These notes should help you get started with the statistics software R, guide you through the practical part of the course and provide additional information on selected topics. Before you continue reading, make sure you have R and R Studio installed on your computer. Try to use R as an calculator as shown in the lecture slides and try to get yourself accustomed to the different windows in the R Studio environment. No prior knowledge about R is necessary and once everything is up and running you can dive right in.
We start by loading some relevant packages for this course. An R package is a collection of functions, data, and documentation that extends the capabilities of base R.
# make sure you have th packages installed using
# install.packages(packagename)
library(tidyverse)
library(Rmisc)
library(tibble)
library(dplyr)
# here i define a color palette to use as standard when plotting with ggplot
# this is purely for aesthetic reasons
cbp2 <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
scale_colour_discrete <- function(...) {
scale_colour_manual(..., values = cbp2)
}
We start with some examples of how to visulaize data quickly and easily. This should show you how to produce high quality visulaizations using R and motivate you to learn more about programming and data analysis with R.
R has several systems for making graphs, and I have chosen to start with ggplot2 beacuse it is elegant, versatile and is easy to leanr if you have no prior knowledge about programming. For those of you who are already familiar wiht programming langauges, doing graphics in “base R” may seem more intuitive initially but I hope I can convince you that the grammar of plotting used by ggplot has lots of advantages and is very readable once you get the hang of it.
If you like to learn more about the theoretical underpinnings of ggplot2 before you start, I would recommend reading “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.
Let us use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
We can find an answer with the mpg data set found in ggplot2 (aka ggplot2::mpg). A data frame is the R data format to store data in a spreadsheet: a rectangular collection of variables (in the columns) and observations (in the rows). The data frame mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.
# the command head() shos us the first few rows of the data set to
# get an overview of what is stored in our dataset
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
Among the variables in mpg are:
displ, a car’s engine size, in litres.
hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
To learn more about mpg, open its help page by running ?mpg. in the next seciton we learn how to visualize such data.
In general you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. For instance, typing ggplot(data = mpg) into the console creates an empty graph (which is boring so it is not shown here). You can now add layers to your plot. For example, the function geom_point() adds a layer of points to your plot, thus creating a scatterplot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
There are many other such geometric functions withing the ggplot2 universe and you can combine them in a single plot. You will learn a whole bunch of them throughout this course, but be aware that we will only scratch the surface of what is available. After this course, you should be ready to equip your self with new tools for data analysis whenever you need them.
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.
Next, we use a different theme beacuse this gray background in the standard design is ugly.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
theme_classic()
The general grammar of ggplot is
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
We will explain each part step by step in examples. The DATA argument should be quite clear: here you specify the dataframe (we will learn more about that alter) that you want to work with.
The second part, GEOM_FUNCTION, specifies what kind of plot you want. As mentioned above geom_point gives you a classic scatter plot. Many other such funtions exist and we will encounter some of them.
Finally, MAPPING specifies which variables hould be plotted. We ussually specify this with an aesthetics mapping of the form mapping = aes(x = …, y =…) to deterimne the variables that give us the x- and y-coordinates. We can however also change the aesthetics of our plot in various way, for instance by giving data points different colors that indicate the value of a third variable:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,color = class)) +
theme_classic()
Another important option is to set a clor manually:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue") +
theme_classic()
As a final remark, you can also create a plot, store it in a variable and then explore themes after that:
p = ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
p+theme_classic()
p+theme_dark()
We now set our prefered them globally, so we do not need to specifiy it all the time:
theme_set(theme_classic())
Base R refers to the standard R functions. While ggplot allows us to make sophisticated graphs very quickly, I find that in some situations base R is just as good or even better. Let us recreate the same plot as before:
plot(mpg$displ,mpg$hwy)
Note how we use the $ symbol to access the variables stored in a data frame here; it allows you to access a specific column of a data frame. Try typing the name of a dataframe, followed by the $ operator in RStudio; it should automatically suggest you the availbale variable names in your data frame.
We can alos add stuff to this plot, change colors, and pretty much everything else we would like to. A quick look at the help function should give you an overview about the scope of the function plot and it’s parameters. For instance, have a look at the following plot, where I make a scatterplot and modifiy the appearance of the symbols, add some labels and a title, and then add a horizontal line at the value of the median fuel efficiency:
plot(mpg$displ,mpg$hwy,pch=16,col="darkblue",xlab="displ",ylab="hwy", main ="a scatterplot")
abline(h = median(mpg$hwy))
When working with data frames, ggplot may seem more attractive and easy to use than base R functions, simply because many steps are automated. You may want to try color-coding the different points according to the varibale class, just as we did before. Then try to add a legend (look up the function legend in the R help). You will find out that it may be a bit tedious to put all this things together. However, it also gives you a lot of control about the final looks of your figure once you have figured out what you want and how you get it. Another advantge of base R functions is that you are not restircted to the data frame format, which often makes things easier. For instance, consider this plot:
x=seq(0,7,by=0.1)
# x = (0,0.1,0.2,0.3, ... ,6.9,7)
y=sin(x)
plot(x,y,lwd=2,type="l",col="deepskyblue4",axes=FALSE)
#plot without axes
axis(1) #add axes to plot
axis(2)
text(2.8,0.85,"y = sin(x)",col="deepskyblue4")
# add some text at coordinates x = 2.8, y = 0.85
lines(x,cos(x),lwd=2,col="green4") # add a cosine plot
text(1.5,-0.5,"y=cos(x)",col="green4") # and label it
legend(3.5,0.8,legend=c("y=sin(x)","y=cos(x)"),
col=c("deepskyblue4","green4"),lwd=2)
# add a legend to the plot
Instead of using a data frame, we have created vectors of variables and used them in several different ways. You will explore this more in the exercise sessions.
By now you should see that R is very versatile and can be used for much more than simply doing statistics. This course will teach you the basics of R so that you can start using it for your research projects. An essential part is that you should also learn how to find and use additional resources, such as the help function in R, books, the internet, etc. Nobody knows all of R - it is an ever growing community endevaour and looking up how things work, which functions or packages are available, or how other people solved a problem is an essential part of learning to be proficient in R.
How to we feed R our data? Of course, we can just type it into the console, a bit like in the previous base R graphics example. For large data sets (or rather anythign that is not an eytremely small data set) this is not practical. We want to load our data from a file, which makes our analysis easily reproducible if we share data and R scripts with someone. First we need to make sure R knows where our working directory is, using the function setwd (set working directory):
setwd("/Users/stephan/Dropbox/Teaching/BlogdownPages/StatisticsForBiology")
If you want to check your current working directory, you can do this with getwd().
I use a dataset about the number of plates on sticklebacks. You can find this dataset (and many others) on ilias. We can load a dataset using the function read.csv and store it in the variable stickleback
stickleback = read.csv("chap03e3SticklebackPlates.csv")
The variable stickleback now contains a so called data frame. We can have a quick summary of the content of this data frame using the command str (short for STRucture):
str(stickleback)
## 'data.frame': 344 obs. of 3 variables:
## $ id : Factor w/ 344 levels "4-1","4-10","4-100",..: 1 102 282 293 2 22 42 131 212 280 ...
## $ plates : int 11 63 22 10 14 11 58 36 31 61 ...
## $ genotype: Factor w/ 3 levels "mm","Mm","MM": 1 2 2 2 1 1 2 2 2 2 ...
We can see form the output that we have 344 observations (= sample size) and for each individual we have measured 3 variables (id, plates and genotypes). ID and genotpye are so-called factors, that is, a catergorical variable and plates is of type integer.
We used the function read.csv because our data is stored in the CSV format (short for comma seperated values - open the file in a text editor to see what that means). CSV is a common format and you can export your data from software like Excel in that format. Of course, R provides functions for other formats as well or you an use the read.table() function where you can specify how the data should be read and transformed into a data frame.
Let us use our ggplot skills to create a nice figure that visualizes our data set. For this we introduce the fucntion geom_bar() which allows us to make a bar plot of the number of plates per indivudal, colored by genotype:
ggplot(data = stickleback) +
geom_bar(mapping = aes(x = plates,color=genotype)) +
theme_classic()
This figure is a bit messy. A better way would be to show the distribution for each genotype in its own window. This can be done using the function facet_wrap. As before, we can simply add this to our ggplot command:
ggplot(data = stickleback) +
geom_bar(mapping = aes(x = plates),fill="darkred") +
facet_wrap(~genotype,nrow = 3) +
theme_classic()
We used a new symbol here: the ~. This is part of an R object called formula and it will alter become clear why we used it here like this. For now, just remember to use it in front of the varibale name in facet wrap. The second aprameter is simply the number of rows in the plot. Try changing it and see what happens. Note: the variable that you pass to facet_wrap shoudl always be categorical!
A data frame is essentially a spreadsheet or a table (or rather a matrix). Therfore, we acess each column or each row, if we want to. This can be done using the [,] symbols. The element in row 3 and column 5 is given by:
stickleback[3,5]
## NULL
the whole 3rd column can be accesses by
stickleback[,3]
## [1] mm Mm Mm Mm mm mm Mm Mm Mm Mm Mm mm mm mm Mm Mm mm Mm Mm MM Mm Mm mm Mm MM
## [26] Mm mm MM Mm Mm MM mm Mm Mm mm Mm mm MM Mm mm Mm MM mm Mm Mm MM Mm mm mm Mm
## [51] Mm MM mm mm mm Mm MM MM Mm MM MM Mm MM Mm mm mm MM Mm Mm MM MM Mm MM Mm MM
## [76] MM Mm Mm Mm mm Mm Mm Mm Mm MM Mm Mm Mm Mm MM mm Mm Mm mm MM Mm mm MM mm mm
## [101] Mm mm mm Mm MM Mm mm mm Mm Mm MM MM Mm mm Mm MM MM Mm mm Mm Mm MM mm MM mm
## [126] mm Mm Mm Mm mm Mm mm Mm mm Mm MM Mm Mm Mm Mm MM Mm MM Mm Mm Mm mm MM Mm MM
## [151] MM MM Mm Mm mm mm Mm MM MM Mm Mm mm Mm Mm mm Mm mm Mm Mm MM Mm MM Mm MM MM
## [176] Mm mm MM mm Mm Mm Mm Mm Mm Mm Mm mm Mm MM Mm Mm MM mm MM Mm mm Mm Mm Mm Mm
## [201] Mm Mm Mm Mm Mm Mm MM MM Mm Mm mm Mm mm mm Mm MM Mm Mm Mm Mm mm mm Mm Mm Mm
## [226] mm MM MM MM Mm Mm mm MM MM MM MM Mm Mm MM Mm MM mm mm Mm Mm Mm MM mm mm MM
## [251] mm Mm MM Mm Mm mm mm Mm Mm MM mm Mm Mm mm Mm Mm Mm Mm mm MM Mm mm Mm MM Mm
## [276] mm Mm Mm MM mm mm Mm MM Mm Mm mm MM mm Mm Mm Mm MM mm MM Mm Mm mm MM MM mm
## [301] MM MM MM Mm Mm MM mm mm MM MM mm Mm Mm Mm Mm MM mm Mm Mm Mm Mm MM Mm mm Mm
## [326] Mm Mm Mm Mm Mm mm MM mm mm Mm Mm Mm mm Mm mm MM mm Mm mm
## Levels: mm Mm MM
and the 5th row by
stickleback[5,]
## id plates genotype
## 5 4-10 14 mm
Note that stickleback[[,3]] gives you all measurment of the 3rd variable in our dataframe and is hence equivalent to stickleback$genotype. The vector stickleback[5,] on the other hand gives you a vector with the 3 measured values of the 5 th individual.
We can also use this way of accessing elements to find elements of a certain type, for instance, fi we want to know the genotype of all individuals with more then 30 plates, we can get this in the following way:
stickleback[stickleback$plates>20,3]
Try to understand what is happening here by teasing apart and investigating the differnt parts of this (nested) command. What would happen if you removed the stickleback$plates>20 from the command, or the 3.
A slightly more elegant way to get this is by using the function filter (make sure you have the package dyplr loaded!):
filter(stickleback,plates>20)
which gives ou a data frame with only the individuals that satisfy the condition set in the argument.
Sometimes we wish to extend our dataframe with calcualtions we did during our analysis or with additional measurements. This can be done with the function mutate(). It simply adds new columns at the end of your dataset. Let’s say we want to identify which indivduals have fewer or more plates as comapred to the global mean:
diff.to.mean = stickleback$plates - mean(stickleback$plates)
stickleback2 = mutate(stickleback,difference = diff.to.mean)
head(stickleback2)
## id plates genotype difference
## 1 4-1 11 mm -32.43314
## 2 4-2 63 Mm 19.56686
## 3 4-4 22 Mm -21.43314
## 4 4-5 10 Mm -33.43314
## 5 4-10 14 mm -29.43314
## 6 4-12 11 mm -32.43314
We have alreaddy seen how to look at a subset of your data frame using indexing. Here is another slightly more elegant way of doing this, especially when many conditions are combined over multiple variables. For instance, what if we want to consider only stickleback with more than 20 plates:
head(subset(stickleback, plates>20))
## id plates genotype
## 2 4-2 63 Mm
## 3 4-4 22 Mm
## 7 4-14 58 Mm
## 8 4-23 36 Mm
## 9 4-31 31 Mm
## 10 4-38 61 Mm
We could now compare how the disitrbution of genotypes changes if we condition on more than 10 plates:
ggplot(data = subset(stickleback, plates>20)) +
geom_bar(mapping= aes(x = genotype)) +
labs(title="Stickleback with more than 20 plates",
x ="Genotype", y = "Count")
ggplot(data = stickleback) +
geom_bar(mapping= aes(x = genotype)) +
labs(title="All Stickleback",
x ="Genotype", y = "Count")
There are a few mm genotypes with more than 20 plates. But how many?
subset(stickleback,plates > 20 & genotype == "mm")
## id plates genotype
## 100 4-25 37 mm
## 132 4-87 22 mm
It turns out there are exactly two: one with 37 plates and one with 22. We have used logical operators here: “&” means that both conditions need to be satisfied. In contrast, try to see what the symbol " | " means:
head(subset(stickleback,plates > 20 | genotype == "mm"))
## id plates genotype
## 1 4-1 11 mm
## 2 4-2 63 Mm
## 3 4-4 22 Mm
## 5 4-10 14 mm
## 6 4-12 11 mm
## 7 4-14 58 Mm
You may have guessed it already, the " | " represetns the logical “or” command, indicating that either of the two conditions needs to be satisfied. Note that for logical comparions we use the “==” and not the “=” symbol, which is reserved for assignments. What is the difference? Try it:
a = 5
a == 5
## [1] TRUE
a == 6
## [1] FALSE
We can also use the symbols < (smaller), > (larger), <= (smaller or equal), >= (larger or equal), != (not equal) and combine them into logical statements.
Add a new varaible to the stickleback data frame that contains the difference between plate number and the mean plate number for that individuals genotype.
Load the flights dataset:
library(nycflights13)
Find all flights that
Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
Departed between midnight and 6am
We continue using the stickleback data set and calcualte some statistics on it. A simple way to get a quick overview about the properties of your data frame is using the funciton summary:
summary(stickleback)
## id plates genotype
## 4-1 : 1 Min. : 6.00 mm: 88
## 4-10 : 1 1st Qu.:14.00 Mm:174
## 4-100 : 1 Median :57.00 MM: 82
## 4-101 : 1 Mean :43.43
## 4-103 : 1 3rd Qu.:62.00
## 4-105 : 1 Max. :69.00
## (Other):338
If you want to calcualte a specific statistic, like the mean of a variable, this can be done easily:
mean(stickleback$plates)
## [1] 43.43314
You may have noticed that we have no calcualted the mean number of plates for ALL fish in our data set. What if we want to calcualte the number of plates for a specific genotype? We can use the function filter:
stickleback.mm = filter(stickleback,genotype=="mm")
mean(stickleback.mm$plates)
## [1] 11.67045
You could now calcualte the mean plate numbers for each genotpye in this way:
mean.mm = mean((filter(stickleback,genotype=="mm"))$plates)
mean.mM = mean((filter(stickleback,genotype=="mM"))$plates)
mean.MM = mean((filter(stickleback,genotype=="MM"))$plates)
Note: The tidyverse offers an elegant way to write a long series of commands in a very readable format. To show you how to combine multiple commands quickly, we first need two more functions, group_by and summarize, and a new operator called the pipe: %>% (because we create a pipeline thorugh which our data “flows”). The above code to calculate the mean for each genotype would become:
stickleback %>%
group_by(genotype) %>%
summarise(avg = mean(plates))
## avg
## 1 43.43314
This is very readable: you take the dataframe stickleback, group it by genotye and summarize it by calcualting the mean number of plates. However, errors might be more difficutl to spot when the code is written in that way, as intermediate steps cannot be checked.
In the exercises you will learn a few more useful functions to rearrange, filter, extend and edit data frames.
Practice Question: Can you come up with anotehr way to do this whithout using filter?
Two possible answers (there are many ways to do this): Straightforward “indexing”:
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="mM"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
or by using the function tapply (look up the help page for this function by typing “?tapply”):
mean.plates = tapply(stickleback$plates,stickleback$genotyp,mean)
We can do a boxplot easily using ggplot:
ggplot(data = stickleback) +
geom_boxplot(mapping = aes(x = genotype, y = plates))
In base R, this can be done as well:
boxplot(stickleback$plates[stickleback$genotype=="MM"],
stickleback$plates[stickleback$genotype=="Mm"],
stickleback$plates[stickleback$genotype=="mm"],
names = c("MM","Mm","mm")
)
Practice Question: Can you plot a histogram of plate numbers of individuals with genotpye mm using base R plotting functions? Hint: Use what you have learned here and look up the function hist(). Alternatively, use ggplot and geom_histogram.
Answer:
hist(stickleback$plates[stickleback$genotype=="mm"],col="darkred",xlab="plates",main="Genotype: mm")
ggplot(data=filter(stickleback,genotype == "mm")) +
geom_histogram(mapping = aes(x = plates))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this part you will see how to explore a data set and store your analysis in a reproducible and reusable way.
So far we only used the console of R. We enter commands and R executes them. Most of the time we want to have a whole bunch of commands stored in a single space: loading the file, preparing and cleaning the data,make figures, etc. The solution is easy, we jsut store our commands in a text file with the extension .R and tell R do execute this script. Evene better, RStudio has its own window panel for writing and executing scripts! Try to open a new R script (via File -> New -> R script) and just write your code in the newly opened window. When you have written your code, you can eitehr execute it step by step by putting the mouse cursor in the line you want to execute and then hit then Run button. If you want to eecute the whole script from beginning to end, just hit the Source button. This allows you to make reproducible scripts than you can modify and reuse whenever you need to. This is probably the biggest advantage of R. Once you have solved a problem and wrote a script for it, you can always reuse it. The more you use R, the less work will be necessary for each new project!
In the next part I will introduce the basic tools for creating an automated analysis that consists of several steps.
Defining variables is really easy in R. Unlike in other programming languages we do not need to precisly define what kind of variable we want. We can just werite down a name and assign a value, R figures out the rest. For instance:
x = 5
automatically creates a numeric variable and assigns the value 5 to it. This command
vec = c(1,2,3,4,5,6)
creates a vectors that contains the values 1 to 6. The c here stands for concatenate and tells us to combine all the numbers into a single vector. You can access the elements of the vector by their index, e.g., we can set the 5th element of the vector to 0 in the follwoing way:
vec[5] = 0
In general it is advised to define the type of a varible before using it, and also tell R how big this variable will be (that is, how much memory we need to reserve to store it). For instance, if we want to define a 3x4 matrix we can use the command matrix:
mat=matrix(ncol=3,nrow=4)
You can then enter numbers in the rows and columns using the [] operator:
mat
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] NA NA NA
mat[2,3]= 5
mat
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA 5
## [3,] NA NA NA
## [4,] NA NA NA
You can see that the matrix is intially filled with the object NA which stands for not available and ist the term R uses for missing data. Just like in dataframes (after all a data frame is more or less a matrix with benefits), you can also name columns and rows and use them to access the elements:
mat = matrix(1:25,nrow=5,ncol=5)
colnames(mat) = c("var1","var2","var3","var4","var5")
rownames(mat) = c("obs1","obs2","obs3","obs4","obs5")
mat
## var1 var2 var3 var4 var5
## obs1 1 6 11 16 21
## obs2 2 7 12 17 22
## obs3 3 8 13 18 23
## obs4 4 9 14 19 24
## obs5 5 10 15 20 25
mat["obs2",]
## var1 var2 var3 var4 var5
## 2 7 12 17 22
mat[,"var3"]
## obs1 obs2 obs3 obs4 obs5
## 11 12 13 14 15
mat["obs2","var3"]
## [1] 12
If we want to store some text, we can write
word = "hello"
and R will automatically recognize that the string of letters “hello” is not numeric and will creat a variable of type character. Try to see how this is different from
letters = c("h","e","l","l","o")
Next try to see what happens here:
hello = 10
word1 = "hello"
word2 = hello
print(word1)
## [1] "hello"
print(word2)
## [1] 10
Before we have seen how to calcualte the mean number of plates for each genotype. Essentially we have done the same thing 3 times. Imagine we woul dhave hundreeds of genotypes. In such situations loops come in handy. They allow us to automate a process that needs to be repeated many times. We only introduce the so-called for-loop here, which will allow you to do a lot of things. Once you have understood how it works, it will be easy to figure out how a while-loop works and what the differences are.
Here is a simple for loop that calcualtes the mean number of plates for each genotpye in our stickleback data set:
# first we create a vector that contains all genotypes in our dataset
# we can do this manually:
# genotypes = c("mm","mM","MM")
# or by using the function levels()
# that gives us a vector with all the different entries found in a
# vector of type factor
genotypes = levels(stickleback$genotype)
# next, we define a vector that has the same length as the number
# of different genotypes and name the elements accordingly
plate.numbers = vector("numeric",length(genotypes))
names(plate.numbers)=genotypes
# no comes the actual loop:
for(g in genotypes)
{
plate.numbers[g]=mean(stickleback$plates[stickleback$genotype==g])
}
We can see what is happening here by showing the intermediate steps more explicitly:
i = 1
for(g in genotypes)
{
print(paste("This is iteration ",i," of our for-loop"))
print(paste("Genotype ",i," = ",g))
mean.temp = mean(stickleback$plates[stickleback$genotype==g])
print(paste("The mean number of plates of genotype ",g," is ",mean.temp))
plate.numbers[g]=mean.temp
i = i+1
}
## [1] "This is iteration 1 of our for-loop"
## [1] "Genotype 1 = mm"
## [1] "The mean number of plates of genotype mm is 11.6704545454545"
## [1] "This is iteration 2 of our for-loop"
## [1] "Genotype 2 = Mm"
## [1] "The mean number of plates of genotype Mm is 50.3793103448276"
## [1] "This is iteration 3 of our for-loop"
## [1] "Genotype 3 = MM"
## [1] "The mean number of plates of genotype MM is 62.780487804878"
There are 3 basics steps here:
Question: Could you come up with an alterantive way to solve this problem using a loop?
Another useful tool are if/else constructs. They allow you to make choices during your script. Let us again consider the sticklbeack data set. Let say we want to calcualte the difference between the number of plates of plates of an individual and the mean of that indivudal’s genotype. We can do this by combining loops with if/else statements (admittedly, there are better ways to achieve this but it is a good instuctive example):
number.obs = dim(stickleback)[1]
diff.to.mean = vector("numeric",length=number.obs)
for (i in 1:number.obs)
{
if(stickleback$genotype[i] == "mm")
diff.to.mean[i] = stickleback$plate[i] - mean.mm
if(stickleback$genotype[i] == "mM")
diff.to.mean = stickleback$plate[i] - mean.mM
if(stickleback$genotype[i] == "MM")
diff.to.mean = stickleback$plate[i] - mean.MM
}
stickleback3= mutate(stickleback2,diff.per.genotype = diff.to.mean)
head(stickleback3)
## id plates genotype difference diff.per.genotype
## 1 4-1 11 mm -32.43314 -0.7804878
## 2 4-2 63 Mm 19.56686 NA
## 3 4-4 22 Mm -21.43314 NA
## 4 4-5 10 Mm -33.43314 NA
## 5 4-10 14 mm -29.43314 NA
## 6 4-12 11 mm -32.43314 NA
R has a set of functions to work with distirbutions and random numbres. Have a look at
help(Distributions)
if you want to know more. Let us focus on the normal distirbution. It’s density is given by \[f(x \mid \mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] The function dnorm returns the value of the probability density function for the normal distribution for x, given parameters \(\mu\) and \(\sigma\). Some examples of using dnorm are below:
# We use the pdf of the normal with x = 0,
# mu = 0 and sigma = 0.
# The dnorm function takes three main arguments,
# as do all of the *norm functions in R.
dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423
# Another exmaple of dnorm where parameters have been changed.
dnorm(2, mean = 5, sd = 3)
## [1] 0.08065691
Let us do a plot showing the density fucntion of the normal distirbution. We use base R as ggplot works fine with data frames and here we prefer to wrok with vectors instead.
# The plot should show x values on the x-axis and the dnorm(x,...)
# values on the y-axis
# First I'll make a vector of x values
x_vals <- seq(-3, 3, by = .1)
# Let's have a look
x_vals
## [1] -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6
## [16] -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1
## [31] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
## [46] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
## [61] 3.0
# The y values are then given by dnorm
y_vals <- dnorm(x_vals,0,1)
# Now we'll plot these values
plot(x_vals,y_vals,
type = "l", # Make it a line plot
main = "pdf of the Standard Normal",
xlab= "x",ylab="density")
Of course, you could create a data frame with these values and then use ggplot if you want to:
std_norm = data.frame(x= x_vals,y=y_vals)
ggplot(data = std_norm) +
geom_line(mapping = aes(x = x,y=y))
The other three most relevant functions are: pnorm (cumulative distribution function), qnorm (quantiles) and rnorm (draw random numbers). The ue of pnorm is completely analogous to dnorm, so we skip it here and focus on themore interesting qnorm and rnorm.
The function qnorm gives us the quantiles of the normal distirbution:
# Let's make a vector of :
# from 0 to 1 by increments of .05
quantiles <- seq(0, 1, by = .05)
quantiles
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
## [16] 0.75 0.80 0.85 0.90 0.95 1.00
# Now we'll find the value for each of those quantiles
# Remeber that the q-quantile is the value Q such that
# P(X>Q) = q
# In other words, the quantile is the inverse of the CDF
qvalues <- qnorm(quantiles)
qvalues
## [1] -Inf -1.6448536 -1.2815516 -1.0364334 -0.8416212 -0.6744898
## [7] -0.5244005 -0.3853205 -0.2533471 -0.1256613 0.0000000 0.1256613
## [13] 0.2533471 0.3853205 0.5244005 0.6744898 0.8416212 1.0364334
## [19] 1.2815516 1.6448536 Inf
We could now plot the PDF of a normal distribution and illustrate a few of the concepts:
plot(x_vals,y_vals,
type = "l", # Make it a line plot
main = "pdf of the Standard Normal",
xlab= "x",ylab="density")
# the 50 % quantile is just the median:
abline(v = qnorm(0.5),col="black",lwd=2,lty=1)
# the 2.5 % and the 97.5% quantile indicated the area where 95% of the most
# common data lies:
abline(v = qnorm(c(0.025,0.975)),col="darkred",lwd=2,lty=2)
text(-2.5,0.3,"qnorm(0.025)",col="darkred")
# In contrast, the function dnorm gives us the value of the PDF
# for a specfific x value, that is, the height of the function f(x)
# at the point x
abline(h = dnorm(-1),col="blue",lwd=2,lty=1)
abline(v = -1,lty=2)
text(-1.4,0.25,"dnorm(-1)",col="blue")
R also allows us to draw random numbers. This is very handy in many situations. The parameters of rnorm are
the number of observations
the mean of the distirbution
the standard deviation of the distribution
Here is an example, where we calculate a sample of 0, 100 and 1000 observations of the same distirbution.
# Let's generate three different vectors of random numbers
# from a normal distribution
n10 <- rnorm(10, mean = 70, sd = 5)
n100 <- rnorm(100, mean = 70, sd = 5)
n10000 <- rnorm(10000, mean = 70, sd = 5)
# Let's just look at one of the vectors
n10
## [1] 69.36258 74.97557 71.78788 72.38635 73.50610 63.29977 63.64326 70.92737
## [9] 73.15134 73.11582
# The breaks argument specifies how many bars are in the histogram
hist(n10, breaks = 5)
hist(n100, breaks = 20)
hist(n10000, breaks = 100)
Note: If we set a seed for our random number generator (the algorithm that is used to calcualte a radnom number), we can make our analysis replicable:
set.seed(100)
rnorm(1,0,1)
## [1] -0.5021924
rnorm(1,0,1)
## [1] 0.1315312
set.seed(100)
rnorm(1,0,1)
## [1] -0.5021924
rnorm(1,0,1)
## [1] 0.1315312
We first generate a data frame with a summary of our data (this can be done in many ways, this is a concise one):
CIs <- summarySE(stickleback, measurevar="plates", groupvars=c("genotype"))
CIs
## genotype N plates sd se ci
## 1 mm 88 11.67045 3.567805 0.3803293 0.7559456
## 2 Mm 174 50.37931 15.146866 1.1482809 2.2664440
## 3 MM 82 62.78049 3.410313 0.3766060 0.7493279
and then we use the data frame with our summary to add error bars to our plot:
ggplot(data = CIs) +
geom_point(mapping = aes(x = genotype,y=plates)) +
geom_errorbar(mapping = aes(x = genotype,y=plates,ymin=plates-ci, ymax=plates+ci), width=.1)
Note: One can shorten things by specifying the aesthetics in the ggplot command - this simply means that all geom_functions use the same x and y variables (which is very often the case, but not always!):
ggplot(data = CIs, aes(x = genotype,y=plates)) +
geom_point() +
geom_errorbar(mapping = aes(ymin=plates-ci, ymax=plates+ci), width=.1)
Note: The function t.test() can also be used to calcualte confidence intervalls assuming a normal distribution.
genotypes = c("MM","Mm","mm")
CI.upper = vector("numeric",3)
CI.lower = vector("numeric",3)
means = vector("numeric",3)
i = 1
for(g in genotypes)
{
results = t.test(stickleback$plates[stickleback$genotype == g])
CI.lower[i] = results$conf.int[1]
CI.upper[i]= results$conf.int[2]
means[i] = results$estimate
i = i + 1
}
CI.data = data.frame(genotype = genotypes,mean = means,CI.lower = CI.lower,CI.upper = CI.upper)
ggplot(data = CI.data, aes(x = genotype,y=means)) +
geom_point() +
geom_errorbar(mapping = aes(ymin=CI.lower, ymax=CI.upper), width=.1)
Copy and paste this R code into a script. Before you run it, try to figure out what it does.
reps = 1000
sample_size = 10
mu = 10
sigma = 5
means = vector("numeric",reps)
for (i in 1:reps)
{
sample = rnorm(sample_size,mu,sigma)
means[i] = mean(sample)
}
hist(means,freq=F,main="A distribution")
x = seq(mu-2*sigma,mu+2*sigma,by=0.01)
SE = sigma/sqrt(sample_size)
y = dnorm(x,mu,SE)
lines(x,y)
In this chpater we will learn how to do hypotehsis tests in R. This is very simple, as you will see - often a single line of code is sufficient for conducting a test.
We use the binomial test to test whether spermatogenesis genes in the mouse genome occur with unusual frequency on the X chromosome. First, we load the dat set and inspect it:
mouseGenes <- read_csv("~/Dropbox/Teaching/BlogdownPages/StatisticsForBiology/chap07e2SexAndX.csv")
head(mouseGenes)
## # A tibble: 6 x 2
## chromosome onX
## <chr> <chr>
## 1 4 no
## 2 4 no
## 3 6 no
## 4 6 no
## 5 6 no
## 6 7 no
# Tabulate the number of spermatogenesis genes on the
# X-chromosome and the number not on the X-chromosome.
table(mouseGenes$onX)
##
## no yes
## 15 10
# Calculate the binomial probabilities of all possible outcomes
# under the null hypothesis. Under the binomial
# distribution with n = 25 and p = 0.061, the number of successes
# can be any integer between 0 and 25.
xsuccesses <- 0:25
probx <- dbinom(xsuccesses, size = 25, prob = 0.061)
data.frame(xsuccesses, probx)
## xsuccesses probx
## 1 0 2.073193e-01
## 2 1 3.367007e-01
## 3 2 2.624760e-01
## 4 3 1.307255e-01
## 5 4 4.670757e-02
## 6 5 1.274386e-02
## 7 6 2.759585e-03
## 8 7 4.865905e-04
## 9 8 7.112305e-05
## 10 9 8.727323e-06
## 11 10 9.071211e-07
## 12 11 8.035781e-08
## 13 12 6.090306e-09
## 14 13 3.956429e-10
## 15 14 2.203032e-11
## 16 15 1.049510e-12
## 17 16 4.261188e-14
## 18 17 1.465509e-15
## 19 18 4.231266e-17
## 20 19 1.012696e-18
## 21 20 1.973624e-20
## 22 21 3.052667e-22
## 23 22 3.605629e-24
## 24 23 3.055193e-26
## 25 24 1.653947e-28
## 26 25 4.297797e-31
barplot(probx,names.arg=xsuccesses,xlab="# successes",ylab="probability")
# Use these probabilities to calculate the P-value corresponding to
# an observed 10 spermatogenesis genes on the X chromosome.
# Remember to multiply the probability of 10 or more successes
# by 2 for the two-tailed test result.
2 * sum(probx[xsuccesses >= 10])
## [1] 1.987976e-06
# For a faster result, try R's built-in binomial test.
# The resulting P-value is slightly different from our calculation. # The output of binom.test includes a confidence interval for the
# proportion using the Clopper-Pearson method, which is more
# conservative than the Agresti-Coull method.
binom.test(10, n = 25, p = 0.061)
##
## Exact binomial test
##
## data: 10 and 25
## number of successes = 10, number of trials = 25, p-value = 9.94e-07
## alternative hypothesis: true probability of success is not equal to 0.061
## 95 percent confidence interval:
## 0.2112548 0.6133465
## sample estimates:
## probability of success
## 0.4
# Agresti-Coull 95% confidence interval for the proportion
# using the binom package.
library(binom)
binom.confint(30, n = 87, method = "ac")
## method x n mean lower upper
## 1 agresti-coull 30 87 0.3448276 0.2532164 0.4495625
We perform a goodness-of-fit test. We use a Poisson probability model and fit it to frequency data on the number of marine invertebrate extinctions per time block in the fossil record. We first read and inspect the data. Each row is a time block, with the observed number of extinctions listed.
The first step is to create a frequency table for the number of time blocks in each number-of-extinctions category. The command table does what’s needed, but note that some extinction categories are not represented (e.g., 0, 12 and 13 extinctions).
extinctTable <- table(extinctData$numberOfExtinctions)
data.frame(Frequency = addmargins(extinctTable))
## Frequency.Var1 Frequency.Freq
## 1 1 13
## 2 2 15
## 3 3 16
## 4 4 7
## 5 5 10
## 6 6 4
## 7 7 2
## 8 8 1
## 9 9 2
## 10 10 1
## 11 11 1
## 12 14 1
## 13 16 2
## 14 20 1
## 15 Sum 76
To remedy the problem of missing categories, we transform the original variable into a factor (that is, a categorical variable) with all counts between 0 and 20 as levels (note that 20 is not the maximum possible number of extinctions, but it is a convenient cutoff for this table).
extinctData = mutate(extinctData,nExtinctFactor = factor(numberOfExtinctions, levels = c(0:20)))
extinctTable2 <- table(extinctData$nExtinctFactor)
Estimate the mean number of extinctions per time block from the data. The estimate is needed for the goodness-of-fit test.
meanExtinctions <- mean(extinctData$numberOfExtinctions)
meanExtinctions
## [1] 4.210526
Calculate expected frequencies under the Poisson distribution using the estimated mean. (For now, continue to use 20 extinctions as the cutoff, but don’t forget that the Poisson distribution includes the number-of-extinctions categories 21, 22, 23, and so on.)
expectedProportion <- dpois(0:20, lambda = meanExtinctions)
expectedFrequency <- expectedProportion * 76
Show the frequency distribution in a histogram.
ggplot(extinctData) +
geom_histogram(mapping= aes(x =numberOfExtinctions),fill = "firebrick",color="black",bins=21)+
labs (x="Number of extinctions")
Make a table of observed and expected frequencies, saving as a data frame.
extinctFreq <- data.frame(nExtinct = 0:20, obsFreq = as.vector(extinctTable2),
expFreq = expectedFrequency)
extinctFreq
## nExtinct obsFreq expFreq
## 1 0 0 1.127730e+00
## 2 1 13 4.748338e+00
## 3 2 15 9.996501e+00
## 4 3 16 1.403018e+01
## 5 4 7 1.476861e+01
## 6 5 10 1.243672e+01
## 7 6 4 8.727524e+00
## 8 7 2 5.249639e+00
## 9 8 1 2.762968e+00
## 10 9 2 1.292616e+00
## 11 10 1 5.442596e-01
## 12 11 1 2.083290e-01
## 13 12 0 7.309790e-02
## 14 13 0 2.367543e-02
## 15 14 1 7.120431e-03
## 16 15 0 1.998718e-03
## 17 16 2 5.259783e-04
## 18 17 0 1.302733e-04
## 19 18 0 3.047328e-05
## 20 19 0 6.753081e-06
## 21 20 1 1.421701e-06
The low expected frequencies will violate the assumptions of the \(\chi^2\) test, so we will need to group categories. Create a new variable that groups the extinctions into fewer categories (again, there are many ways to do this, here I use the function cut).
extinctFreq$groups <- cut(extinctFreq$nExtinct,
breaks = c(0, 2:8, 21), right = FALSE,
labels = c("0 or 1","2","3","4","5","6","7","8 or more"))
extinctFreq
## nExtinct obsFreq expFreq groups
## 1 0 0 1.127730e+00 0 or 1
## 2 1 13 4.748338e+00 0 or 1
## 3 2 15 9.996501e+00 2
## 4 3 16 1.403018e+01 3
## 5 4 7 1.476861e+01 4
## 6 5 10 1.243672e+01 5
## 7 6 4 8.727524e+00 6
## 8 7 2 5.249639e+00 7
## 9 8 1 2.762968e+00 8 or more
## 10 9 2 1.292616e+00 8 or more
## 11 10 1 5.442596e-01 8 or more
## 12 11 1 2.083290e-01 8 or more
## 13 12 0 7.309790e-02 8 or more
## 14 13 0 2.367543e-02 8 or more
## 15 14 1 7.120431e-03 8 or more
## 16 15 0 1.998718e-03 8 or more
## 17 16 2 5.259783e-04 8 or more
## 18 17 0 1.302733e-04 8 or more
## 19 18 0 3.047328e-05 8 or more
## 20 19 0 6.753081e-06 8 or more
## 21 20 1 1.421701e-06 8 or more
Then sum up the observed and expected frequencies within the new categories.
obsFreqGroup <- tapply(extinctFreq$obsFreq, extinctFreq$groups, sum)
expFreqGroup <- tapply(extinctFreq$expFreq, extinctFreq$groups, sum)
ObsVsExp = data.frame(obs = obsFreqGroup, exp = expFreqGroup,group= c("0 or 1","2","3","4","5","6","7","8 or more"))
ObsVsExp
## obs exp group
## 0 or 1 13 5.876068 0 or 1
## 2 15 9.996501 2
## 3 16 14.030177 3
## 4 7 14.768608 4
## 5 10 12.436722 5
## 6 4 8.727524 6
## 7 2 5.249639 7
## 8 or more 9 4.914760 8 or more
The expected frequency for the last category, “8 or more”, doesn’t yet include the expected frequencies for the categories 21, 22, 23, and so on (remeber that we used a cut-off of 20 before!). However, the expected frequencies must sum to 76. In the following, we recalculate the expected frequency for the last group, expFreqGroup[length(expFreqGroup)], as 76 minus the sum of the expected frequencies for all the other groups.
expFreqGroup[length(expFreqGroup)] = 76 - sum(expFreqGroup[1:(length(expFreqGroup)-1)])
data.frame(obs = obsFreqGroup, exp = expFreqGroup)
## obs exp
## 0 or 1 13 5.876068
## 2 15 9.996501
## 3 16 14.030177
## 4 7 14.768608
## 5 10 12.436722
## 6 4 8.727524
## 7 2 5.249639
## 8 or more 9 4.914761
Finally, we are ready to carry out the \(\chi^2\) goodness-of-fit test. R gives us a warning here because one of the expected frequencies is less than 5. However, we have been careful to meet the assumptions of the \(\chi^2\) test, so let’s persevere. Once again, R doesn’t know that we have estimated a parameter from the data (the mean), so it won’t use the correct degrees of freedom when calculating the P-value. As before, we need to grab the \(\chi^2\) value calculated by chisq.test and recalculate \(P\) using the correct degrees of freedom. Since the number of categories is now 8, the correct degrees of freedom is 8 - 1 - 1 = 6.
saveChiTest <- chisq.test(obsFreqGroup, p = expFreqGroup/76)
## Warning in chisq.test(obsFreqGroup, p = expFreqGroup/76): Chi-squared
## approximation may be incorrect
saveChiTest
##
## Chi-squared test for given probabilities
##
## data: obsFreqGroup
## X-squared = 23.95, df = 7, p-value = 0.001163
# Wrong degrees of freedom, so wrong P-value!
pValue <- 1 - pchisq(saveChiTest$statistic, df = 6)
# correct P-value!
pValue
## X-squared
## 0.0005334919
Finally, we can compare the expected and observed data in a histogramm. One can clearly see that the fit is not very good.
df = data.frame(x = 0:20,y = 76*dpois(0:20,meanExtinctions))
ggplot(extinctData) +
geom_histogram(mapping= aes(x =numberOfExtinctions),fill = "firebrick",color="black",bins=21)+
labs (x="Number of extinctions") +
geom_line(data = df, mapping = aes(x = x,y=y))
We use a one-sample t-test to compare body temperature in a random sample of people with the “expected” temperature 98.6 F.
Read and inspect the data:
heat <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter11/chap11e3Temperature.csv"))
head(heat)
## individual temperature
## 1 1 98.4
## 2 2 98.6
## 3 3 97.8
## 4 4 98.8
## 5 5 97.9
## 6 6 99.0
ggplot(data = heat) +
geom_histogram(mapping = aes(x = temperature),fill = "firebrick",color="black",bins=15) +
labs(x = "Body temperature (degrees F)",y = "Frequency", main = "")
A one-sample t-test can be calculate using the function t.test. The mu argument gives the value stated in the null hypothesis. Let us first transform the data into celsius using the formula to turn: \[(X ??F - 32) ?? 5/9 = Y ??C\]
mu0 = (98.6-32)*5/9
heat = mutate(heat,temperatureC = (temperature-32) * 5/9)
ggplot(data = heat) +
geom_histogram(mapping = aes(x = temperatureC),fill = "firebrick",color="black",bins=15) +
labs(x = "Body temperature (degrees C)",y = "Frequency", main = "")
t.test(heat$temperatureC, mu = mu0)
##
## One Sample t-test
##
## data: heat$temperatureC
## t = -0.56065, df = 24, p-value = 0.5802
## alternative hypothesis: true mean is not equal to 37
## 95 percent confidence interval:
## 36.80235 37.11321
## sample estimates:
## mean of x
## 36.95778
We compare horn length of live and dead (spiked) horned lizards.
lizard <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter12/chap12e3HornedLizards.csv"))
head(lizard)
## squamosalHornLength Survival
## 1 25.2 living
## 2 26.9 living
## 3 26.6 living
## 4 25.6 living
## 5 25.7 living
## 6 25.9 living
Note that there is one missing value for the variable “squamosalHornLength”. Everything is easier if we eliminate the row with missing data.
lizard2 <- na.omit(lizard)
head(lizard2)
## squamosalHornLength Survival
## 1 25.2 living
## 2 26.9 living
## 3 26.6 living
## 4 25.6 living
## 5 25.7 living
## 6 25.9 living
ggplot(data = lizard2) +
geom_histogram(mapping = aes(x = squamosalHornLength ),fill = "firebrick",color="black") +
facet_wrap( ~Survival,nrow=2) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A two-sample t-test of the difference between two means can be carried out with t.test by using a formula, asking if squamosalHornLength is predicted by Survival, and specifying that the variables are in the data frame lizard.
t.test(squamosalHornLength ~ Survival, data = lizard)
##
## Welch Two Sample t-test
##
## data: squamosalHornLength by Survival
## t = -4.2634, df = 40.372, p-value = 0.0001178
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.381912 -1.207092
## sample estimates:
## mean in group killed mean in group living
## 21.98667 24.28117
The same reuslt could be achieved with
t.test(lizard$squamosalHornLength[lizard$Survival=="killed"], lizard$squamosalHornLength[lizard$Survival=="living"])
##
## Welch Two Sample t-test
##
## data: lizard$squamosalHornLength[lizard$Survival == "killed"] and lizard$squamosalHornLength[lizard$Survival == "living"]
## t = -4.2634, df = 40.372, p-value = 0.0001178
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.381912 -1.207092
## sample estimates:
## mean of x mean of y
## 21.98667 24.28117
The output of t.test includes the 95% confidence interval for the difference between means. Add $confint after calling the function to get R to report only the confidence interval. The formula in the following command tells R to compare squamosalHornLength between the two groups indicated by Survival.
t.test(squamosalHornLength ~ Survival, data = lizard)$conf.int
## [1] -3.381912 -1.207092
## attr(,"conf.level")
## [1] 0.95
*Note: R has used the Welch two sample t-test here. If we want to force R to use the standrad t.test we set a parameter specifiying this:
t.test(squamosalHornLength ~ Survival, data = lizard,var.equal = TRUE)
##
## Two Sample t-test
##
## data: squamosalHornLength by Survival
## t = -4.3494, df = 182, p-value = 2.27e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.335402 -1.253602
## sample estimates:
## mean in group killed mean in group living
## 21.98667 24.28117
We investigate the ratio of biomass between marine reserves and non-reserve control areas.
marine <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter13/chap13e1MarineReserve.csv"))
head(marine)
## biomassRatio
## 1 1.34
## 2 1.96
## 3 2.49
## 4 1.27
## 5 1.19
## 6 1.15
ggplot(data = marine) +
geom_histogram(mapping = aes(x = biomassRatio))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Normal quantile plot of biomass ratio data.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
qqPlot(marine$biomassRatio, distribution = "norm")
## [1] 19 20
The function log() takes the natural logarithm of all the elements of a vector or variable. The following command puts the results into a new variable in the same data frame, marine.
marine = mutate(marine,logBiomassRatio = log(biomassRatio))
Histogram and QQ-Plot of the log-transformed marine biomass ratio.
ggplot(data = marine) +
geom_histogram(mapping = aes(x = logBiomassRatio),col="black",fill="darkred")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqPlot(marine$logBiomassRatio, distribution = "norm")
## [1] 19 20
95% confidence interval of the mean using the log-transformed data.
t.test(marine$logBiomassRatio)$conf.int
## [1] 0.3470180 0.6112365
## attr(,"conf.level")
## [1] 0.95
Back-transform lower and upper limits of confidence interval (exp is the inverse of the log function).
conf.int = exp(t.test(marine$logBiomassRatio)$conf.int )
meanBiomassRatio = mean(marine$biomassRatio)
ggplot(data = marine) +
geom_histogram(mapping= aes(x = biomassRatio),col="black",fill="darkred") +
geom_vline(xintercept = conf.int)+
geom_vline(xintercept = meanBiomassRatio,lwd=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We use the Wilcoxon rank-sum test (equivalent to the Mann-Whitney U-test) comparing times to mating (in hours) of starved and fed female sagebrush crickets. We also apply the permutation test to the same data.
Read and inspect data:
cannibalism <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter13/chap13e5SagebrushCrickets.csv"))
head(cannibalism)
## feedingStatus timeToMating
## 1 starved 1.9
## 2 starved 2.1
## 3 starved 3.8
## 4 starved 9.0
## 5 starved 9.6
## 6 starved 13.0
ggplot(data = cannibalism) +
geom_histogram(mapping = aes(x = timeToMating ),bins=12,color="black",fill="darkred") +
facet_wrap( ~feedingStatus,nrow=2) +
theme_classic()
We perfrom the Wilcoxon rank-sum test (equivalent to Mann-Whitney U-test)
wilcox.test(timeToMating ~ feedingStatus, data = cannibalism)
##
## Wilcoxon rank sum test
##
## data: timeToMating by feedingStatus
## W = 88, p-value = 0.3607
## alternative hypothesis: true location shift is not equal to 0
We use a permutation test of the difference between mean time to mating of starved and fed crickets.We begin by calculating the observed difference between means (starved minus fed). The difference is -18.25734 in this data set.
# tapply is another way to calcualte the means of time to mating
# for each feeding status seperatly
cricketMeans <- tapply(cannibalism$timeToMating, cannibalism$feedingStatus, mean)
cricketMeans
## fed starved
## 35.98462 17.72727
diffMeans <- cricketMeans[2] - cricketMeans[1]
diffMeans
## starved
## -18.25734
We choose to perform 1000 permutations and set a seed to make the analysis reproducible:
nPerm <- 10000
set.seed(2048)
The following code implements aloop to permute the data many times (determined by nperm, investigate the function sample to see what it does). In the loop, i is just a counter that goes from 1 to nPerm by 1; each permuted difference is saved in the permResult.
permResult <- vector() # initializes
for(i in 1:nPerm){
# step 1: permute the times to mating
permSample <- sample(cannibalism$timeToMating, replace = FALSE)
# step 2: calculate difference betweeen means
permMeans <- tapply(permSample, cannibalism$feedingStatus, mean)
permResult[i] <- permMeans[2] - permMeans[1]
}
Plot null distribution based on the permuted differences.
hist(permResult, right = FALSE, breaks = 100,col="darkred")
Use the null distribution to calculate an approximate P-value. This is the twice the proportion of the permuted means that fall below the observed difference in means, diffMeans (-18.25734 in this example). The following code calculates the number of permuted means falling below diffMeans.
sum(as.numeric(permResult <= diffMeans))
## [1] 657
These commands obtain the fraction of permuted means falling below diffMeans.
sum(as.numeric(permResult <= diffMeans)) / nPerm
## [1] 0.0657
Finally, multiply by 2 to get the P-value for a two-sided test.
2 * ( sum(as.numeric(permResult <= diffMeans)) / nPerm )
## [1] 0.1314
We can illustrate the permutaed differences and add the observed difference on to show the outcome of our test: the birght red colored bars indicate the proportion of permutation results that yielded a more exterme results as the on observed.
hist(permResult[permResult>diffMeans],breaks=seq(-50,50,by=1),col="darkred",main="")
hist(permResult[permResult<=diffMeans],breaks=seq(-50,50,by=1),add=T,col="red")
abline(v = diffMeans,col="red",lwd=2)
We compare the phase shift in the circadian rhythm of melatonin production in participants given alternative light treatments.
Read and inspect the data.
circadian <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter15/chap15e1KneesWhoSayNight.csv"))
Set the preferred ordering of groups in tables and graphs.
circadian$treatment <- factor(circadian$treatment,
levels = c("control", "knee", "eyes"))
meanShift <- tapply(circadian$shift, circadian$treatment, mean)
sdevShift <- tapply(circadian$shift, circadian$treatment, sd)
n <- tapply(circadian$shift, circadian$treatment, length)
mean_SD =data.frame(mean = meanShift, std.dev = sdevShift, n = n,treatment = c("control", "knee", "eyes"))
We plot the data showing the raw data points as well as the mean (including error bars shwoing one standard error in each direction).
ggplot(data = circadian) +
geom_jitter(mapping = aes(x = treatment,y=shift),width=0.1) +
geom_errorbar(data = mean_SD,mapping = aes(x = treatment,ymin=meanShift-sdevShift/sqrt(n), ymax=meanShift+sdevShift/sqrt(n)), width=.1,color="red") +
geom_point(data = mean_SD,mapping = aes(x = treatment,y=meanShift),color="red")
The first step involves fitting the ANOVA model to the data using lm (“lm” stands for “linear model”, of which ANOVA is one type). Then we use the command anova to assemble the ANOVA table.
circadianAnova <- lm(shift ~ treatment, data = circadian)
anova(circadianAnova)
## Analysis of Variance Table
##
## Response: shift
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.2245 3.6122 7.2894 0.004472 **
## Residuals 19 9.4153 0.4955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R2 indicating the fraction of variation in the response variable “explained” by treatment. This is again done in two steps. The first step calculates a bunch of useful quantities from the ANOVA model object previously created with a lm command. The second step shows the R2 value.
circadianAnovaSummary <- summary(circadianAnova)
circadianAnovaSummary$r.squared
## [1] 0.4341684
Analyze data from a factorial experiment investigating the effects of herbivore presence, height above low tide, and the interaction between these factors, on abundance of a red intertidal alga using two-factor ANOVA.
Read and inspect the data.
algae <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter18/chap18e3IntertidalAlgae.csv"))
head(algae)
## height herbivores sqrtArea
## 1 low minus 9.405573
## 2 low minus 34.467736
## 3 low minus 46.673485
## 4 low minus 16.642139
## 5 low minus 24.377498
## 6 low minus 38.350604
We first take a look at the data:
ggplot(data = algae, aes(x = height, color = herbivores, group = herbivores, y = sqrtArea)) +
geom_point()
This is a mess so we next add some summary statistics of the data: the mean Area for each combination of explanatory variables.
ggplot(data = algae) +
aes(x = height, color = herbivores, group = herbivores, y = sqrtArea) +
geom_point()+
stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom = "line")+
labs(title="Hopp YB! ;-)",
x ="height", y = "Aread")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.y` is deprecated. Use `fun` instead.
It looks like there is some interaction between height and herbivores, but there is also quite some noise so it is difficult to tell.
Let us first fit a null model having both main effects but no interaction term. We can do this again with lm (note: conceptually a t-test, an ANOVA and linear regression are all pretty much the same thing and hence the function lm can be used in all these cases):
algaeNoInteractModel <- lm(sqrtArea ~ height + herbivores, data = algae)
summary(algaeNoInteractModel)
##
## Call:
## lm(formula = sqrtArea ~ height + herbivores, data = algae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.171 -13.748 -2.235 15.433 34.576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.520 3.602 7.362 5.53e-10 ***
## heightmid 2.358 4.160 0.567 0.5729
## herbivoresplus -9.722 4.160 -2.337 0.0227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.64 on 61 degrees of freedom
## Multiple R-squared: 0.0866, Adjusted R-squared: 0.05665
## F-statistic: 2.892 on 2 and 61 DF, p-value: 0.06311
Now fit the full model, with interaction term included.
algaeFullModel <- lm(sqrtArea ~ height * herbivores, data = algae)
Finally we cretea na ANOVA table that compares the two models:
anova(algaeNoInteractModel, algaeFullModel)
## Analysis of Variance Table
##
## Model 1: sqrtArea ~ height + herbivores
## Model 2: sqrtArea ~ height * herbivores
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 61 16888
## 2 60 14270 1 2617 11.003 0.001549 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We test the null hypothesis of zero regression slope. The data are from an experiment investigating the effect of plant species diversity on the stability of plant biomass production.
Read and inspect the data.
prairie <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e3PlantDiversityAndStability.csv"))
head(prairie)
## nSpecies biomassStability
## 1 1 7.47
## 2 1 6.74
## 3 1 6.61
## 4 1 6.40
## 5 1 5.67
## 6 1 5.26
We perfrom a Log-transform on the varibale stability and include the new variable in the data frame. Inspect the data frame once again to make sure the function worked as intended.
prairie = mutate(prairie,logStability = log(biomassStability))
head(prairie)
## nSpecies biomassStability logStability
## 1 1 7.47 2.010895
## 2 1 6.74 1.908060
## 3 1 6.61 1.888584
## 4 1 6.40 1.856298
## 5 1 5.67 1.735189
## 6 1 5.26 1.660131
Scatter plot with regression line in base R:
plot(logStability ~ nSpecies, data = prairie)
prairieRegression <- lm(logStability ~ nSpecies, data = prairie)
abline(prairieRegression)
Scatter plot with regression line in ggplot:
ggplot(data = prairie,aes(y=logStability,x=nSpecies)) +
geom_point()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
prairieRegression <- lm(logStability ~ nSpecies, data = prairie)
summary(prairieRegression)
##
## Call:
## lm(formula = logStability ~ nSpecies, data = prairie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97148 -0.25984 -0.00234 0.23100 1.03237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.198294 0.041298 29.016 < 2e-16 ***
## nSpecies 0.032926 0.004884 6.742 2.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 159 degrees of freedom
## Multiple R-squared: 0.2223, Adjusted R-squared: 0.2174
## F-statistic: 45.45 on 1 and 159 DF, p-value: 2.733e-10
confint(prairieRegression)
## 2.5 % 97.5 %
## (Intercept) 1.11673087 1.27985782
## nSpecies 0.02328063 0.04257117
Finally, we check our modelling assumptions with a QQ plot of the residuals as well as a Tukey-Anscombe plot:
qqPlot(prairieRegression$residuals,dist="norm")
## [1] 161 66
plot(prairieRegression$fitted.values,prairieRegression$residuals,xlab="fitted",ylab="Residuals")
Assume that \({B_{i}, i=1,2,3,\ldots }\) is a set of pairwise disjoint events whose union is the entire sample space (that is, every possible event that can happen is covered by exactly one of the \(B_i\)s).
As an example: consider rolling a six-sided die. Then the event \(B_i (i = 1, ..., 6)\) is the event that you roll the number \(i\). All events are disjoint and together they cover everything that can happen.
For any event \(A\) we can then write
\[P(A) = \sum _{i}P(A \cap B_{i})\]
or, equivalently,
\[P(A) = \sum_{i} P(A \mid B_{i})P(B_{i}).\]
This figure is helpful for understanding what is going on here:
Illustration of the law of total probability
Example:
We roll two dice. We want to know the probaility that the first roll has a smaller outcome than the second roll.
We call the events \(B_i, i = 1, ..., 6\) the event that the frist roll shows \(i\) dots. So \(B_2\) is the event that you roll a 2 with the first roll. (Think about why this satisfies the conditons for the events \(B_i\) stated in the law of total probaility.)
Let us clall \(X_1\) the outcome of the first roll and \(X_2\) the outcome of the second roll: \(P(B_2)\) is then \(P(X_1 = 2)\). The event \(A\) is the event that \(X_1 > X_2\). We use the formula
\[P(A) = \sum_{i} P(A \mid B_{i})P(B_{i}).\]
If you roll a \(i\) with the first die, there are 6-\(i\) possible outcomes for the second roll that are larger than \(i\). This means: \(P(A \mid B_{1}) = 5/6\) \(P(A \mid B_{2}) = 4/6\) \(P(A \mid B_{3}) = 3/6\) \(P(A \mid B_{4}) = 2/6\) \(P(A \mid B_{5}) = 1/6\) \(P(A \mid B_{6}) = 0\)
and furthermore
\(P(B_i) = 1/6\) for each \(i\).
Then
\(P(A) = 5/6 * 1/6 + 4/6 * 1/6\) \(+3/6 * 1/6 + 2/6*1/6 + 1/6 *1/6 + 0 *1/6\)
\(= 0.417.\)
Let us test this with a small R simulation:
# We do 10000 die rolls for each die:
die.1 = sample.int(6, size = 10000,replace=T)
die.2 = sample.int(6, size = 10000,replace=T)
# calcualte the proportion of times die 2 is larger than die 1
sum(die.2>die.1)/10000
## [1] 0.4129
The peppered moth (Biston betularia) occurs in two types: peppered (speckled black and white) and melanic (black). A researcher wished to measure the proportion of melanic individuals in the peppered moth population in England, to examine how this proportion changed from year to year in the past. To accomplish this, she photographed all the peppered moth specimens available in museums and large private collections and grouped them by the year in which they had been collected. Based on this sample, she calculated the proportion of melanic individuals in every year. The people who collected the specimens, she knew, would prefer to collect whichever type was rarest in any given year, since those would be the most valuable.
Can the specimens from any given year be considered a random sample from the moth population?
If not a random sample, what type of sample is it?
What type of error might be introduced by the sampling method when estimating the proportion of melanic moths?
Which of the following numerical variables are continuous? Which are discrete? (Note: Discrete variables are those that have a finite set of predefined values, whereas continuous variables can take any value within a given interval.)
Number of injuries sustained in a fall
Fraction of birds in a large sample infected with avian flu virus
Number of crimes committed by a randomly sampled individual
Logarithm of body mass
For each of the following studies, say which is the explanatory variable and which is the response variable. Also, say whether the study is observational or experimental.
Forestry researchers wanted to compare the growth rates of trees growing at high altitude to that of trees growing at low altitude. They measured growth rates using the space between tree rings in a set of trees harvested from a natural forest.
Researchers randomly assign diabetes patients to two groups. In the first group, the patients receive a new drug, tasploglutide, whereas patients in the other group receive standard treatment without the new drug. The researchers compared the rate of insulin release in the two groups.
Psychologists tested whether the frequency of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people.
Spinner Hansen et al. (2008) studied a species of spider whose females often eat males that are trying to mate with them. The researchers removed a leg from each male spider in one group (to make them weaker and more vulnerable to being eaten) and left the males in another group undamaged. They studied whether survival of males in the two groups differed during courtship.
Bowen et al. (2012) studied the effects of advanced communication therapy for patients whose communication skills had been affected by previous strokes. They randomly assigned two therapies to stroke patients. One group received advanced communication therapy and the other received only social visits without formal therapy. Both groups otherwise received normal, best-practice care. After six months, the communication ability (as measured by a standardized quantitative test score) was measured on all patients.
During World War II, the British Royal Air Force estimated the density of bullet holes on different sections of planes returning to base from aerial sorties. Their goal was to use this information to determine which plane sections most needed additional protective shields. (It was not possible to reinforce the whole plane, because it would weigh too much.) They found that the density of holes was highest on the wings and lowest on the engines and near the cockpit, where the pilot sits (their initial conclusion, that therefore the wings should be reinforced, was later shown to be mistaken). What is the main problem with the sample: bias or large sampling error? What part of the plane should have been reinforced?
Figure for exercise 4: Airplane bullet holes
An important quantity in conservation biology is the number of plant and animal species inhabiting a given area. To survey the community of small mammals inhabiting Kruger National Park in South Africa, a large series of live traps were placed randomly throughout the park for one week in the main dry season of 2004. Traps were set each evening and checked the following morning. Individuals caught were identified, tagged (so that new captures could be distinguished from recaptures), and released. At the end of the survey, the total number of small mammal species in the park was estimated by the total number of species captured in the survey.
What is the parameter being estimated in the survey?
Is the sample of individuals captured in the traps likely to be a random sample? Why or why not? In your answer, address the two criteria that define a sample as random.
Is the number of species in the sample likely to be an unbiased estimate of the total number of small mammal species in the park?
No. Collectors did not choose randomly, but prefered rarer types over more common ones.
It is a sample of convenience.
Thus, there is bias in every year’s sample because rare types are over represerented. Further this bias might change across years as the frequencies of the two morphs have changed over time.
Discrete.
Technically it is a discrete variable (because fractions can be enumerated / are retricted to fintely many values in a fintie sample) but it makes sense to treat it as a continous variable if the sample is very large.
Discrete. There are no half crimes.
Continous. Body mass is continous and hence the log as well.
E(xplanatory): Altitude (categorical: high vs low) R(esponse): Growth rate S(tudy type): Observational
E: Treatment (standard vs. tasploglutide) R: Rate of insulin release S: Experimental
E: Health status (schizophrenia vs. healthy) R: Frequency of illegal drug use S: Observational
E:Number of legs R: Survival propability S: Experimental
E: Treatment (advanced communication therapy vs. social visits without formal therapy) R: Communication ability S: Experimental
The main problem is a strong bias in the sample. To see this, consider the sample of planes that were used in this study. Only planes that were not hit in critical areas were available to estimate the distribution of bullet holes. Planes that were hit in a critical area, i.e., one that leads to a crash, were not available because they did not return to base. With this knowledge, it becomes clear that it would have been better to reinforce the areas were no, or very little, bullet holes were found, namely the cockpit and engine.
The population parameter being estimated is all the small mammals of Kruger National Park.
No, the sample is not likely to be random. In a random sample, every individual has the same chance of being selected. But some small mammals might be easier to trap than others (for example, trapping only at night might miss all the mammals active only in the day time). In a random sample individuals are selected independently. Multiple animals caught in the same trap might not be independent if they are related or live near one another (this is harder to judge).
The number of species in the sample might underestimate the number in the Park if sampling was not random (e.g., if daytime mammals were missed), or if rare species happened to avoid capture. Even if the sample is perfectly random, the estimator will underestimated the true number in most cases. You can sample fewer species just by chance, but not more species as there actually are. Thus - on average ??? you will underestimate the true number.
We use the MPG data set introduced in the lecture. Answer the following questions. You can use base R or ggplot if not otherwise specified.
Run ggplot(data = mpg). What do you see?
How many rows are in mpg? How many columns?
What does the drv variable describe? Read the help for ?mpg to find out.
Make a scatterplot of hwy vs cyl.
What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
What is wrong with this code? Why are the points not blue?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))
You should see a blank plot. This is what ggplot does, it simply creates a “canvas” to which you can add “layers”.
How many rows are in mpg? How many columns?
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
There are 234 observations of 11 variables. thus, there are 11 columns and 234 rows. we can confirm this by looking at the dimension of the underlying data matrix:
dim(mpg)
## [1] 234 11
?mpg
drv is a categorical variable indicating the drive type: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data = mpg) +
geom_point(mapping = aes(y = hwy,x = cyl)) +
theme_classic() +
labs(title = "A scatterplot", y = "miles per gallon on highways",x="cylinders")
ggplot(data = mpg) +
geom_point(mapping = aes(x = class,y = drv)) +
theme_classic() +
labs(title = "A useless plot", x = "class",y="drive type")
both drv and class are categorical variable and it does not make sense to visualize them this way. What would be a better way to show these data?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
A continous variable will lead ot a continous color gradient rather than a discrete set of colors.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, color = displ))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Load the data set “chap04e1HumanGeneLengths.csv” (the lengh of genes in the human genome, found on ilias) and answer the following questions and provide the R code you used for that. Put all of your analysis in a single R script.
Load the data
genes <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))
n = dim(genes)[1]
print(n)
## [1] 20290
sum.gene.lengths = sum(genes$geneLength)
mean.gene.length1 = sum.gene.lengths/n
mean.gene.length2 = mean(genes$geneLength)
print(mean.gene.length1)
## [1] 2622.027
print(mean.gene.length2)
## [1] 2622.027
square.sum.gene.lengths = sum(genes$geneLength^2)
print(square.sum.gene.lengths)
## [1] 223678143206
# by hand:
variance.gene.length = sum((genes$geneLength - mean.gene.length1)^2)/(n-1)
print(variance.gene.length)
## [1] 4149235
# for comparison:
var(genes$geneLength)
## [1] 4149235
sd(genes$geneLength)
## [1] 2036.967
# or
sd.gene.length = sqrt(variance.gene.length)
cv.gene.length = sd.gene.length/mean.gene.length1
print(cv.gene.length)
## [1] 0.7768672
# Which one do you like best?
ggplot(data = genes) +
geom_histogram(mapping = aes(x = geneLength),color="black",fill="darkred") +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = genes) +
geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") +
theme_classic()
ggplot(data = genes) +
geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") +
xlab("Länge der Gene") +
ylab("Häufigkeit") +
scale_x_log10() +
theme_classic()
ggplot(data = genes) +
geom_boxplot(mapping = aes(y = geneLength),color="black",fill="darkred") +
theme_classic() +
coord_flip()
Use the stickleback dataset and calcualte the median and mean for each genotpye. Make a histogramm and add vertical lines to the histogramm indicating the mean and median for each genotype. What do you see? How does the shape of the distribution affect differences between mean and median?
# load the data file:
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))
# one possible solution:
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
median.mm = median(stickleback$plates[stickleback$genotype=="mm"])
median.mM = median(stickleback$plates[stickleback$genotype=="Mm"])
median.MM = median(stickleback$plates[stickleback$genotype=="MM"])
par(mfrow = c(1,3))
hist(stickleback$plates[stickleback$genotype=="MM"],
main = "MM",xlab = "number of plates",
col="darkred")
abline(v = mean.MM,col="black")
abline(v = median.MM,col="blue")
hist(stickleback$plates[stickleback$genotype=="Mm"],
main = "Mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mM,col="black")
abline(v = median.mM,col="blue")
hist(stickleback$plates[stickleback$genotype=="mm"],
main = "mm",xlab = "number of plates",
col="darkred")
abline(v = mean.mm,col="black")
abline(v = median.mm,col="blue")
# alternatively, we can first create a data frame
# with the means per genotype
# check out the help function of ddply
# (this is only one way to do this, you can also do it by
# "hand" or using other functions such as tapply) :
library(dplyr)
df.mean = ddply(stickleback, "genotype", summarize, m.number = mean(plates))
df.median = ddply(stickleback, "genotype", summarize, m.number = median(plates))
# then we plot it with a single ggplot
ggplot() +
geom_bar(data = stickleback, mapping = aes(x=plates),binwidth=5,color="black",fill="darkred") +
geom_vline(data = df.mean, aes(xintercept=m.number),col="black",size=1) +
geom_vline(data = df.median, aes(xintercept=m.number),col="blue",size = 1) +
facet_wrap(~ genotype) +
ylab("frequency") +
theme_classic()
## Warning: Ignoring unknown parameters: binwidth
Copy and paste this R code into a script. Before you run it, try to figure out what it does. Modify the output so that it looks nicer, i.e., add labels, change theme, change colors, …
Tip 1: Use the help function or the internet to figure out what the different commands and functions are doing.
Tip 2: This might also be helpful: https://s-peischl.github.io/StatisticsForBiology/#Working_with_Distributions_and_random_variables
Tip 3: Look at the next exercise, it may give you an idea what is happening in this R code …
reps = 1000
sample_size = 10
mu = 10
sigma = 5
means = vector("numeric",reps)
for (i in 1:reps)
{
sample = rnorm(sample_size,mu,sigma)
means[i] = mean(sample)
}
x = seq(mu-2*sigma,mu+2*sigma,by=0.01)
SE = sigma/sqrt(sample_size)
y = dnorm(x,mu,SE)
df2 = data.frame(x = x,y = y)
df = data.frame(means = means)
ggplot(data = df) +
geom_histogram(aes(x = means),binwidth=1) +
geom_line(data = df2,aes(x = x,y = y*reps))
This script simulates the sampling distribution of the mean.
Write a short R script that simulates the sampling distribution of the mean. Plot the sampling distribution and add the theoretical expectation to the plot. Follow the following steps:
Tip if you get stuck: Have a look at the previous exercise for inspiration …
# We choose the Poisson distribution with mean lambda = 50
lambda = 50
# We do 1000 replicates
reps = 1000
# We choose the sample size n to be 10
sample_size = 10
# the means of each sample will be stroed in this vector
means = vector("numeric",reps)
# in this loop, we will calcualte the means of the sample
for (i in 1:reps)
{
# take a sample of the poisson distribution
sample = rpois(sample_size,lambda)
# store the mean in our vector
means[i] = mean(sample)
}
# a histogram - note that we set freq = FALSE to
# show realtive frequencies
hist(means,freq=F,main="The sampling distribution of a Poisson distribution")
# we specify the x values for the theoretical poisson
x = seq(0,3*lambda,by=0.1)
#we calcualte the standard error:
SE = sqrt(lambda)/sqrt(sample_size)
# caclualte the corresponding PDF of the poisson
y = dnorm(x,lambda,SE)
lines(x,y)
The pizza below, ordered from the Venn Pizzeria on Bayes Street, is divided into eight slices.
Figure: Venn Pizza
Answer the following questions based on the drawing of the pizza:
What is the probability that a randomly drawn slice has pepperoni on it?
What is the probability that a randomly drawn slice has both pepperoni and anchovies on it?
What is the probability that a randomly drawn slice has either pepperoni or anchovies on it?
Are pepperoni and anchovies mutually exclusive on this pizza?
Are olives and mushrooms mutually exclusive on this pizza?
Are getting mushrooms and getting anchovies independent when randomly picking a slice of pizza?
If I pick a slice from this pizza and tell you that it has olives on it, what is the chance that it also has anchovies?
If I pick a slice from this pizza and tell you that it has anchovies on it, what is the chance that it also has olives?
Seven of your friends each choose a slice at random and eat them without telling you what toppings they had. What is the chance that the last slice has olives on it?
You choose two slices at random. What is the chance that they have both olives on them? (Be careful: after removing the first slice, the probability of choosing one of the remaining slices changes.)
What is the probability that a randomly chosen slice does not have pepperoni on it?
Draw a pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
The pizza below, ordered from the Venn Pizzeria on Bayes Street, is divided into eight slices.
Figure: Venn Pizza
Answer the following questions based on the drawing of the pizza:
\[P(pepperoni) = 5/8 = 0.625 = 62.5 \%\]
\[P(pepperoni \quad and \quad anchovies) = 2/8 = 25 \%\]
\[P(pepperoni \quad or \quad anchovies) = 7/8 = 0.875 = 87.5\%\]
No. There are two slices with both pepperoni and anchovies.
Yes. There is no slice that has both olives and mushrooms on it.
No, because
\[P(anchovy) = 4/8 = 0.5\]
\[P(mushroom) = 3/8 = 0.375\]
and
\[P(mushroom \quad and \quad anchovy) = 1/8\]
but
\[P(mushroom)*P(anchovy) = 3/8 * 4/8 = 3/16 = 0.1875 = 18.75 \% .\]
\[P(anchovy \mid olive) =\] \[P(anchovy \quad and \quad olive)/P(olive) = (1/8) / (2/8) = 1/2 = 0.5 = 50\%.\]
\[P(olive \mid anchovy) =\] \[P(olive \quad and \quad anchovy)/P(anchovy) = (1/8) / (4/8) = 1/4 = 0.25 = 25\%.\]
\[P(last \quad slice \quad has \quad olives) = 2/8 = 0.25 = 25\%.\]
This can be seen directly because each slice has the same probability to be the last slice.
\[P(first \, slice \,has \, olives) = 2/8\]
\[P(second \, slice \, also \, has \, olives) = 1/7\]
\[P(both \, slices \, have \, olives) = 2/8 * 1/7 = 1/28 \approx 3.6\% \]
\[P(no \, pepperoni) = 1 – P(pepperoni) = 3/8 = 37.5\%\]
A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive
After graduating from your university with a biology degree, you are interviewed for a lucrative job as a snake handler in a circus sideshow. As part of your audition, you must pick up two rattlesnakes from a pit. The pit contains eight snakes, three of which have been defanged and are assumed to be harmless, but the other five are definitely still dangerous. Unfortunately budget cuts have eliminated the herpetology course from the curriculum and so you have no way of telling in advance which snakes are dangerous and which are not. You pick one snake with your left hand and one with your right.
\[P(no \, dangerous \, snakes) = \] \[ P(no \, dangerous \, snake \, in \, left \, hand) P(no \, dangerous \, snake \, in \, right \, hand)\] \[= 3/8 * 2/7 = 6/56 = 0.107 = 10.7 \%\]
\[P(bite) = P(bite \mid 0 \, dangerous \, snakes) P(0 \, dangerous \, snakes)\] \[ + P(bite \mid 1 \, dangerous snakes) P(1 dangerous \, snakes) \] \[ +P(bite \mid 2 \, dangerous snakes) P(2 \, dangerous \, snakes) \]
Now we calculate the ingredients of this formula: \[P(0 \, dangerous \, snakes) = 0.107\] (from (a)) \[P(1 \, dangerous \, snake) = (5/8 * 3/7) + (3/8 * 5/7) = 0.536\] \[P(2 \, dangerous \, snakes) = 5/8 * 4/7 = 0.357\] \[P(bite \mid 0 \, dangerous \, snakes) = 0\] \[P(bite \mid 1 \, dangerous \, snake) = 0.8\] \[P(bite \mid 2 \, dangerous \, snakes) = 1- P(no \, bite \mid 2 \, dangerous \, snakes) = 1-(1- 0.8)^2 = 0.96\]
Putting all this together: \[P(bite) = (0*0.107)+(0.8*0.536)+(0.96*0.357) = 0.772\]
\[P(defanged \mid no \, bite)= \left[P(no \, bite \mid defanged)P(defanged)\right]/P(no \, bite)\] \[P(no \, bite \mid defanged)=1\] \[P(defanged) = 3/8 \]
\[ P(no \, bite) = P(defanged)P(no bite \mid defanged) \] \[ + P(dangerous)P(no \, bite \mid dangerous) = (3/8 *1) + (5/8 (1-0.8)) = 0.5\]
Therefore \[ P(defanged \mid no \, bite) = (1*3/8 )/ (0.5) = 0.75 = 75\%.\]
Five different researchers independently take a random sample from the same population and calculate a 95% confidence interval for the same parameter.
What is the probability that all five researchers have calculated an interval that includes the true value of the parameter?
What is the probability that at least one does not include the true parameter value.
\(0.95^5 = 0.774\)
\(1- 0.95^5 = 0.226\)
Use the stickleback data set. Calculate a confidence interval for the mean plate numbers for each genotype. Make a plot that shows the mean and the confidence interval of each genotype. Discuss which groups are different and which are not based on your analysis.
We first calculate the mean for each genotype:
stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")
mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])
Next the upper and lower limit of the 95% CI using a t-test:
t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])
CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int
We put it all in a dataframe:
df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
genotypes = c("mm","Mm","MM"))
Now we can plot everything:
ggplot(data = df) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means))
If you prefer (I don’t) ou can also do a bar plot
ggplot(data = df) +
geom_col(mapping = aes(x = genotypes,y=means,fill=genotypes)) +
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
geom_point(mapping = aes(x = genotypes,y=means))
Either way, it is advised to also add the raw data:
ggplot(data = df) +
geom_jitter(data = stickleback,mapping = aes(x = genotype,y = plates,color=genotype),size=0.8,width=0.05)+
geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.2) +
geom_point(mapping = aes(x = genotypes,y=means))
Make a plot of the probability density function of a Normal distribution. Choose values for the mean and variance. Next make a plot of the probability density function of the standard Normal distribution. What do you see? How are these plots different / similar?
Draw a random sample (sample size \(n = 1000\)) from the Normal distribution that you chose in (a). Plot a histogram of that sample.
Combine the two plots in a single plot, i.e., superimpose the probability density function on the histogram.
Calculate the 10 percent quantile of the distribution. Store the value in a variable \(Q10\). Then calculate the proportion of numbers in your random sample from (b) that are smaller or equal to \(Q10\). What do you find?
mu = 10
sigma = 2
Next we choose the values for the x axis. A good choice is to cover the range from \(\mu - 4 \sigma\) to \(\mu + \sigma\). We chose a stepsize of 0.1:
x = seq(mu-4*sigma,mu+4*sigma,by=0.1)
Now we compute the corresponding \(y\) values for our plot using the function dnorm:
y = dnorm(x, mean = mu, sd = sigma)
I now do the same thing for the standard normal distribution:
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
And finally we plot both density functions next to each other:
par(mfrow = c(1,2))
plot(x,y,type="l",xlab="x",ylab="density")
plot(x.std,y.std,type="l",xlab="x",ylab="density")
We see that both curves look exactly the same. The only difference is the scale of the x and y axes.
b and c)
random.sample = rnorm(1000,mu,sigma)
hist(random.sample,freq=F,ylim=c(0,0.2),breaks=30)
lines(x,y)
Q10 = qnorm(0.1,mean=mu,sd=sigma)
sum(random.sample<=Q10)/1000
## [1] 0.097
We find that approximately 10 percent of our random sample is smaller than \(Q10\). This is exactly the definition of a quantile!
Let \(X\) be a random variable that follows the standard normal distribution. Calculate the \(p = 0.05,0.1,0.15, ..., 0.95\) percentiles of the standard Normal distribution. Store the values in a vector \(Q\). Next calculate the cumulative distribution function for the values in \(Q\). Store this in a variable \(C\). Finally, plot \(p\) against \(Q\). What do you see? Can you explain?
Plot the probability density function of a Normal distribution. Indicate the smallest 5 percent of the distribution. In other words: Let \(X\) be a normally distributed random variable. For which range \((-\infty,c_1]\) to we get \(P(X \leq c_1) = 0.05\)?
Indicate the most extreme 5 percent of the distribution. In other words: For which range \((-\infty,c_2] \cup [c_2,\infty]\) to we get \(P(X \leq c_2 \text{ or } X \ > c_2 ) = 0.05\)?
[Hint: you can use the quantile function of \(R\) and the symmetry of the Normal distribution to simplify things]
p = seq(0.05,0.95,by=0.05)
Q = qnorm(p,mean=0,sd=1)
C = pnorm(Q,mean=0,sd=1)
plot(p,C)
We find that \(p = C\). This means that the cummulative distribution function is the inverse of the quantile function.
A mathematical explanation of this relationship:
The p-quantile \(q_p\) is defined as the number such that
\[P(X < q_p) = p.\] We define a quantile function \(Q(p) = q_p\). The cumulative distribution function \(F(x)\) gives us the value \[F(x) = P(X<x)\] Then \(F(q_p) = P(X < q_p) = p\)
and because
\[Q(p) = q_p\] if follows
that
\[F(Q(p)) = F(q_p) = p\] Here is a graphical represetantion of this relationship
cdf.std = pnorm(x.std,mean=0,sd=1)
plot(x.std,cdf.std,type="l",xlab="x",ylab = "P(X<x) ")
abline(v=qnorm(0.1,mean=0,sd=1),lwd=2,col="dodgerblue")
abline(h = 0.1,col="slategray",lwd=2)
text(-1.8,0.3,expression(Q[10]),col="dodgerblue")
text(-3.3,0.2,expression(P(X < Q[10])==0.1),col="slategray")
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c1 = qnorm(0.05,0,1)
abline(v = c1)
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c2 = qnorm(c(0.025,0.975),0,1)
abline(v=c2)
Use the stickleback data set.
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))
I show the calcualtions for one example using base R functions:
plates.mm = stickleback$plates[stickleback$genotype=="mm"]
mean.mm = mean(plates.mm)
sd.mm = sd(plates.mm)
z.mm = (plates.mm - mean.mm)/(sd.mm)
hist(z.mm,xlim=c(-10,10),freq=F)
x.vals = seq(-10,10,by=0.01)
y.vals =dnorm(x.vals,0,1)
lines(x.vals,y.vals)
t-test confidence interval:
t.test(plates.mm)
##
## One Sample t-test
##
## data: plates.mm
## t = 30.685, df = 87, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 10.91451 12.42640
## sample estimates:
## mean of x
## 11.67045
2 SE approximation:
n = length(plates.mm)
CI = c(mean.mm-2*sd.mm/(sqrt(n)),mean.mm+2*sd.mm/(sqrt(n)))
print(CI)
## [1] 10.90980 12.43111
We find very little differences between the two CIs. The CI using the t-test function is a bit narrower, reflecting that for the given sample size the percentiles of the t-distribution are smaller than the value of 2 used in the approximation:
qt(0.975,df = n-1)
## [1] 1.987608
Assume that Z is a number randomly chosen from a standard normal distribution. Calculate each of the following probabilities :
P(Z > 1.34)
P(Z < 1.34)
P(Z < 2.15)
P(Z < 1.2)
P(0.52 < Z < 2.34)
P(-2.34 < Z < -0.52)
P(Z < -0.93)
P(-1.57 < Z < - 0.32)
Make a sketch for each case and calculate the values using R functions.
Remember that pnorm(x) = P(Z < x) and that 1 - pnorm(x) = P(Z > x).
x = seq(-4,4,by=0.01)
y = dnorm(x,0,1)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(2.3,0.05,"P(Z > 1.34)")
prob = 1 - pnorm(1.34,0,1)
prob
## [1] 0.09012267
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(0,0.05,"P(Z < 1.34)")
prob = pnorm(1.34,0,1)
prob
## [1] 0.9098773
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.15,col = "red")
text(0,0.05,"P(Z < 2.15)")
prob = pnorm(2.15,0,1)
prob
## [1] 0.9842224
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.2,col = "red")
text(0,0.05,"P(Z < 1.2)")
prob = pnorm(1.2,0,1)
prob
## [1] 0.8849303
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.34,col = "red")
abline(v = 0.52,col = "red")
text(1.5,0.3,"P(0.52 <
Z <
2.34)")
prob = pnorm(2.34,0,1) - pnorm(0.52,0,1)
prob
## [1] 0.2918899
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -2.34,col = "red")
abline(v = -0.52,col = "red")
text(-1.5,0.3,"P(0.52 <
Z <
2.34)")
prob = pnorm(-0.52,0,1) - pnorm(-2.34,0,1)
prob
## [1] 0.2918899
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -0.93,col = "red")
text(-2,0.3,"P(Z < -0.93)")
prob = pnorm(-0.93,0,1)
prob
## [1] 0.1761855
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -1.57,col = "red")
abline(v = - 0.32,col = "red")
text(-1.5,0.3,"P(-1.57 <
Z <
- 0.32)")
prob = pnorm(-0.32,0,1) - pnorm(-1.57,0,1)
prob
## [1] 0.3162766
Use the data set “chap11e2Stalkies.csv” of eye span measurements from a sample of stalk-eyed flies:
stalkie = read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter11/chap11e2Stalkies.csv"))
mean.stalk = mean(stalkie$eyespan)
sd.stalk = sd(stalkie$eyespan)
z = (stalkie$eyespan - mean.stalk)/sd.stalk
hist(z,freq = F,xlim=c(-3,3))
x.val = seq(-3,3,by=0.01)
y.val = dnorm(x.val,0,1)
lines(x.val,y.val)
n = length(stalkie$eyespan)
SE = sd.stalk/sqrt(n)
mu = 9
t = (mean.stalk-mu)/SE
t.test(stalkie$eyespan-9)
##
## One Sample t-test
##
## data: stalkie$eyespan - 9
## t = -1.6738, df = 8, p-value = 0.1327
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.52838420 0.08393976
## sample estimates:
## mean of x
## -0.2222222
y.val = dt(x.val,df = n-1)
plot(x.val,y.val,type = "l")
abline(v = t)
We use the function pt which gives us the cumulative distribution function of the t-distribution. The degrees of freed, are given by sample size - 1. The cumulative distribution function gives us the probability P(T<t), so we can calcualte:
prob.t = pt(t,df=n-1)+(1-pt(-t,df=n-1))
print(prob.t)
## [1] 0.1327105
As the world warms, the geographic ranges of species might shift toward cooler areas. Chen et al. (2011) studied recent changes in the highest elevation of 31 taxa, in meters, over the late 1900s and early 2000s. Positive numbers indicate upward shifts in elevation, and negative numbers indicate shifts to lower elevations. The values are given by:
data = c(58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5, 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3, 32.7, 65.3, -19.3, 7.6, -5.2, -2.1, 31.0, 69.0, 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8)
n = length(data)
print(paste("The sample size is ",n))
## [1] "The sample size is 31"
m = mean(data)
print(paste("The mean shift is ",m, "meters."))
## [1] "The mean shift is 39.3290322580645 meters."
sd = sd(data)
print(paste("The standard deviation of the shift is ",round(sd,digits=2)," meters."))
## [1] "The standard deviation of the shift is 30.66 meters."
SE = sd/sqrt(n)
print(paste("The standard error of the mean is ",round(SE,digits=2),"."))
## [1] "The standard error of the mean is 5.51 ."
We can either use results from the t-test function:
tt = t.test(data)
tt$conf.int
## [1] 28.08171 50.57635
## attr(,"conf.level")
## [1] 0.95
or
upper = m + 2*SE
lower = m - 2*SE
c(lower,upper)
## [1] 28.31452 50.34355
A more precise version:
upper = m + qt(0.975,df = n - 1)*SE
lower = m + qt(0.025,df = n - 1)*SE
c(lower,upper)
## [1] 28.08171 50.57635
\(H_0:\) The mean elevational shift is 0 meters. \(H_A:\) The mean elevational shift is not 0 meters.
You could use the mean elevational shift, or better the standardized version:
mu.0 = 0 # as specified in the null hypothesis
t.stat = (m - mu.0)/SE
t.stat
## [1] 7.141309
The standard normal distribution is a good approximation or, even better, the t-distribution:
# we first calculate the null distribution of our t statisitc
x = seq(-8,8,by=0.01)
y = dt(x,df = n-1)
# and plot it
plot(x,y,type="l",xlab="t",ylab="density")
# we add the observed t statistic
abline(v = t.stat,col="slategray")
We see in the figure that the observed t statistic is far away from the mode of the t distribution. The p value should therefore be very small. To calculate it, we calculate the area under the curve to the right of the statistic, and double this value to account for extreme events on the left end of the distribution (we perform a two-sided test). The function pt would give us the area to the left of the test statistic, so we use 2*(1-pt(t.stat,df = n-1)) to get the p-value:
2*(1-pt(t.stat,df = n-1))
## [1] 6.056689e-08
A four-year review at the provincial Hospital in Alotau, Papua New Guinea (Barss 1984), found that 1/40 of their hospital admissions were injuries due to falling coconuts. If coconuts weigh on average 3.2 kg, and the upper bound of the 95% CI is 3.5 kg, what is the lower bound of this CI? Assume a normal distribution of coconut weights.
Perform a goodness-of-fit test for a data set with two categories. The data file is “chap08e4XGeneContent.csv” and can be found on ilias or online:
geneContent <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e4XGeneContent.csv"))
Start by reading and inspecting the data. Each row is a different gene, with its occurrence on the X chromosome indicated. Then test if the proportion of genes on the X chromosome (among all genes in the genome) is the same as the proportional length of the X chromosome measured in base pairs (among the whole genome). The relative size of the X chromosome is given by \(propX = 1055/20290\) and that of the other chromosomes is \(propRest = 19235/20290\).
Plot a frequency table showing the number of genes on the X and on other chromosomes.
Perform a \(\chi^2\) goodness-of-fit test to the proportional model.
Draw a conclusion from your test. #### Solution
geneContent <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e4XGeneContent.csv"))
propX = 1055/20290
propAut = 19235/20290
head(geneContent)
## chromosome
## 1 X
## 2 X
## 3 X
## 4 X
## 5 X
## 6 X
geneContent$chromosome <- factor(geneContent$chromosome, levels = c("X","NotX"))
geneContentTable <- table(geneContent$chromosome)
barplot(geneContentTable)
ggplot(geneContent) +
geom_bar(mapping = aes(x = chromosome))
chisq.test( geneContentTable, p = c(propX,propAut))
##
## Chi-squared test for given probabilities
##
## data: geneContentTable
## X-squared = 75.065, df = 1, p-value < 2.2e-16
Intepretation:
Our tests shows that there is a very small \(p-value\) indicating that the null hypothesis
\(H_0:\) The proportion of genes on the X-chromomose is the same as the proportion of base pairs on the X-Chromosome.
can be rejected in favor of the alternative hypothesis
\(H_A:\) The proportion of genes on the X-chromomose is not the same as the proportion of base pairs on the X-Chromosome.
Load the dataset chap08e1DayOfBirth.csv:
birthDay <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e1DayOfBirth.csv"))
head(birthDay)
## day
## 1 Sunday
## 2 Sunday
## 3 Sunday
## 4 Sunday
## 5 Sunday
## 6 Sunday
Each row represents a single birth, and shows the day of the week of birth. Test the null hypothesis that birthdays are randomly distributed in the sense that each day occurs with the same probability. Visualize your results and give an interpretation of the test results. Can you come a up with a potential explanation?
birthDay <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e1DayOfBirth.csv"))
head(birthDay)
## day
## 1 Sunday
## 2 Sunday
## 3 Sunday
## 4 Sunday
## 5 Sunday
## 6 Sunday
Put the days of the week in the correct order for tables and graphs.
birthDay$day <- factor(birthDay$day, levels = c("Sunday", "Monday",
"Tuesday", "Wednesday","Thursday","Friday","Saturday"))
birthDayTable <- table(birthDay$day)
data.frame(Frequency = addmargins(birthDayTable))
## Frequency.Var1 Frequency.Freq
## 1 Sunday 33
## 2 Monday 41
## 3 Tuesday 63
## 4 Wednesday 63
## 5 Thursday 47
## 6 Friday 56
## 7 Saturday 47
## 8 Sum 350
Bar graph of the birth data. The argument cex.names = 0.8 shrinks the names of the weekdays to 80% of the default size so that they fit in the graphics window – otherwise one or more names may be dropped.
barplot(birthDayTable, cex.names = 0.8)
shortNames = substr(names(birthDayTable), 1, 3)
barplot(table(birthDay$day), names = shortNames,
ylab = "Frequency", las = 1, col = "firebrick")
\(\chi^2\) goodness-of-fit test. The vector p is the expected proportions rather than the expected frequencies, and they must sum to 1 (R nevertheless uses the expected frequencies when calculating the test statistic).
chisq.test(birthDayTable, p = rep(1/7,times=7))
##
## Chi-squared test for given probabilities
##
## data: birthDayTable
## X-squared = 15.24, df = 6, p-value = 0.01847
We find a small \(p-value\) (< 0.05) so we can rejected the null hypothesis
$H_0: $ birthdays are uniformly distributed across weekdays.
in favor of the alternative hypothesis
$H_A: $ birthdays are not uniformly distributed across weekdays.
From inspecting tha data, it appears as if births are less common on Sundays. It could be that scheduled caesarean deliveries are not scheduled on weekends. Another reason could be that hospitals may sometimes be understaffed on weekends and hence there might be a tendency to wait a bit longer with inducing labor in some non-critical cases.
Note, however, that the p-value is not extremely small. We would reject the null hypothesis if we choose \(\alpha = 0.05\) but not if we chosse \(\alpha = 0.01\). This illustrates that your choice of rejecting or not rejecting a null hypothesis is strongly influenced by the degree of certainity you want to have in oyur decision.
We continue the “summary exercise” from week 8: As the world warms, the geographic ranges of species might shift toward cooler areas. Chen et al. (2011) studied recent changes in the highest elevation of 31 taxa, in meters, over the late 1900s and early 2000s. Positive numbers indicate upward shifts in elevation, and negative numbers indicate shifts to lower elevations. The values are given by:
data = c(58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5, 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3, 32.7, 65.3, -19.3, 7.6, -5.2, -2.1, 31.0, 69.0, 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8)
The most accurate way is either to use t-test again
tt$p.value
## [1] 6.056689e-08
We can also do it by hand. For this it is good to make a sketch first:
# we first calcualte the null distirbution of our t statisitc
x = seq(-8,8,by=0.01)
y = dt(x,df = n-1)
# and plot it
plot(x,y,type="l",xlab="t",ylab="density")
# we add the observed t statistic
abline(v = t.stat,col="slategray")
We see in the figure that the observed t statistic is far away from the mode of the t distribution. The p value should therefore be very small. To calculate it, we calculate the area under the curve to the right of the statistic, and double this value to account for extreme events on the left end of the distribution (we perform a two-sided test). The function pt would give us the area to the left of the test statistic, so we use 2*(1-pt(t.stat,df = n-1)) to get the p-value:
2*(1-pt(t.stat,df = n-1))
## [1] 6.056689e-08
We choose a significance threshold of 5 percent, i.e., we want to be 95 percent certain that we do not have a false positive. Th p-value is clearly smaller than that. Thus, the t-test shows that there is a statistically significant change in the highest elevation. Because the p-value is so small, we can be very certain of this result as it would also hold at much lower significance thresholds.
Use the stickleback data set yet again. Perform a t.test for each comparison of plate numbers between genotypes. Compare the outcome of the t.tests with the results from the confidence intervals. What do you find?
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="Mm"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "Mm"]
## t = -32.001, df = 208.06, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -41.09355 -36.32416
## sample estimates:
## mean of x mean of y
## 11.67045 50.37931
t.test(stickleback$plates[stickleback$genotype=="mm"],
stickleback$plates[stickleback$genotype=="MM"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -95.49, df = 167.89, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -52.16670 -50.05336
## sample estimates:
## mean of x mean of y
## 11.67045 62.78049
t.test(stickleback$plates[stickleback$genotype=="Mm"],
stickleback$plates[stickleback$genotype=="MM"])
##
## Welch Two Sample t-test
##
## data: stickleback$plates[stickleback$genotype == "Mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -10.262, df = 207.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.78364 -10.01871
## sample estimates:
## mean of x mean of y
## 50.37931 62.78049
We find that all t-test yield a significant result (small p values). This means that the null hypothesis of the mean of two groups being the same can be rejected for all comparisons.
A shorter way is by using the function pairwise.t.tests. The First argument is the variable on which the t-test should be performed. The second parameters is the grouping factor: the categorical variable that separates the first argument into different groups.
pairwise.t.test(stickleback$plates,stickleback$genotype)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: stickleback$plates and stickleback$genotype
##
## mm Mm
## Mm < 2e-16 -
## MM < 2e-16 1.5e-15
##
## P value adjustment method: holm
How does this finding relate to the confidence intervals? The confidence intervals are not overlapping, strongly suggesting that the means between groups are indeed different (at the 95 percent level). Thus, the t-test and the confidence interval indicate the same result.
A forensic pathologist wants to know whether there is a difference between the rate of cooling of freshly killed bodies and those which were reheated, to determine whether you can detect an attempt to mislead a coroner about time of death. He tested several mice for their “cooling constant” both when the mouse was originally killed and then after the mouse was re-heated. The data can be found in the file “mouse_cooling.csv” on ilias. Test whether there is any difference in the cooling constants between freshly killed and reheated corpses. Check whether you can use a t-test and justify your decision to use it. If a t-test is not justified, try to use an alternative strategy (e.g., non-parametric test).
library(tidyverse)
mouse_cooling <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/mouse_cooling.csv")
mouse_cooling = mutate(mouse_cooling,diff = freshly.killed - reheated)
ggplot(mouse_cooling) +
geom_histogram(aes(x = diff),bins=7)
The data is clearly right skewed. This can also be seen with a QQ plot:
library(car)
qqPlot(mouse_cooling$diff)
## [1] 15 2
The null hypothesis is
\(H_0:\) The true mean of the differences is \(\mu_0 = 0\).
We try the following log transformation: \(x' = log(100,100+x)\). Note that we add a constant so that we do not get negative values and then use the base 100 logarithm. The new value of our null hpyothesis is then \(\mu_0' = 1\) because \(x' = log(100,100+x)\) and therefore \(\mu_0' = log(100,100 + \mu_0) = 1\).
library(car)
qqPlot(log(100,100+mouse_cooling$diff))
## [1] 15 18
ggplot(mouse_cooling) +
geom_histogram(aes(x = log(100,100+diff)),bins=7)
This looks much better now and we perform a t-test on these data:
results = t.test(log(100,100+mouse_cooling$diff),mu = 1)
results
##
## One Sample t-test
##
## data: log(100, 100 + mouse_cooling$diff)
## t = -1.4466, df = 18, p-value = 0.1652
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
## 0.9254516 1.0137511
## sample estimates:
## mean of x
## 0.9696013
An alternative would be the sign test:
k = sum(mouse_cooling$diff<0)
results.sign = binom.test(k,n=19,p=0.5)
results.sign
##
## Exact binomial test
##
## data: k and 19
## number of successes = 7, number of trials = 19, p-value = 0.3593
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.1628859 0.6164221
## sample estimates:
## probability of success
## 0.3684211
In both cases we find no significant deviation from 0 in the differences. We thus cannot reject the null hypothesis.
For each of the following scenarios, identify the best statistical test to use and state the null hypothesis. (Please note, do not give the answer to the specific question, but simply state the best test to use and the null hypothesis for the scenario.)
Asking whether stickleback fish occur with equal probability through all areas of a pond. We have raken random samples at various locations of the pond (all the same size) and have recorded the number of fish in each sampled location.
We want to know which of two tumor treatments showed better results. We recorded the change in tumor size during the treatment in two groups of patients, group 1 was treated with drug A and group 2 with drug B.
We want to know whether a new tumor treatment showed good results. We recorded the change in tumor size before and after the treatment in each patient.
Asking whether the number of smokers in a list of cities is proportional to the population size of those cities.
Testing whether patients change in body mass during a hospital stay.
For each of the following scenarios, identify the best statistical test to use and state the null hypothesis. (Please note, do not give the answer to the specific question, but simply state the best test to use and the null hypothesis for the scenario.)
\(\chi^2\) Goodness of fit test H0: Fish numbers per unit area follow a Poisson distribution.
Two sample t-test or a non-parametric alternative. H0: There is no difference in the mean tumor size change in the two groups.
One sample t-test or a non-parametric alternative. H0: Themean tumor size change durign the treatment is 0.
\(\chi^2\) Goodness-of-fit test H0: The number of smokers in a city is proportional to the number of people in that city.
One-sample t-test H0: Mean change in patient body mass during hospital stay is zero.
Health spending per person from a randomsample of 20 countries is given in the file “healt_expenditure.csv” on ilias. We will use this sample to estimate the mean of log health expenditure, including a confidence interval.
Visualize the frequency distribution using a histogram. Does the data deviate from a normal distribution? If yes, how?
Why is a log transformation a good choice for these data?
What is the sample size?
Calculate the natural logarithm of each data point.
What is the mean log health expenditure?
What is the standard deviation of the log health expenditure?
Calculate the standard error of the mean log health expenditure?
Calculate the 95% confidence interval of the mean log health expenditure.
What are the limits of this confidence interval expressed on original (non-log) scale?
health.expenditure <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/health_expenditure.csv")
ggplot(health.expenditure) +
geom_histogram(aes(x = Per.capita.health.expenditure.in.2010),bins=10)
Becaue it is highly right skewed.
The sample size is
n = dim(health.expenditure)[1]
n
## [1] 18
data = mutate(health.expenditure,log.expend= log(Per.capita.health.expenditure.in.2010))
data$log.expend
## [1] 6.761573 5.768321 5.476464 6.782192 6.156979 4.276666 4.094345 6.408529
## [9] 3.850148 5.192957 5.509388 4.691348 7.436617 4.997212 5.888878 4.343805
## [17] 7.305860 6.522093
ggplot(data) +
geom_histogram(aes(x = log.expend),bins=4)
qqPlot(data$log.expend)
## [1] 13 9
The mean is
M =mean(data$log.expend)
M
## [1] 5.636854
The standard deviation is
SD = sd(data$log.expend)
SD
## [1] 1.110587
The standard error is
SE = SD/sqrt(n)
SE
## [1] 0.2617679
CI.log = t.test(data$log.expend)$conf.int
CI.log
## [1] 5.084572 6.189136
## attr(,"conf.level")
## [1] 0.95
or alternatively we can approximate it by:
CI.log.app = c(M-2*SE,M+2*SE)
CI.log.app
## [1] 5.113318 6.160390
CI.nonlog = exp(CI.log)
CI.nonlog
## [1] 161.5108 487.4248
## attr(,"conf.level")
## [1] 0.95
M.nonlog = mean(data$Per.capita.health.expenditure.in.2010)
hist(data$Per.capita.health.expenditure.in.2010,col="firebrick",main="",breaks=10)
abline(v = CI.nonlog,lty=2)
abline(v = M.nonlog,lwd=2)
Note how asymmetric the back transformed CI is - this reflects the asymmetry of the data. Be careful however, as this is an approximation! (Note: mean of logarithm transformed data is not the same as the logarithm of the mean of the data).
Recycling paper has some obvious benefits, but it may have unintended consequences. For example, perhaps people are less careful about how much paper they use if they know their waste will be recycled. Catlin and Wang (2013) tested this idea by measuring paper use in two groups of experimental participants. Each person was placed in a room alone with scissors, paper and a trash can, and was told that he or she was testing the scissors. In the “recycling” group only, there was also a recycling bin in the room. The amount of paper used by each participants was measured in grams.
No recycling bin: 4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23 With recycling bin: 4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130
Make and examine a histogram of these data. Are the frequency distributions of the two groups similar in shape and spread?
Based on (a), discuss your options for testing a difference between the groups.
Apply a Mann-Whitney U-test (R function wilcox.test). State the null and alternative hypotheses.
Calculate the P-value as accurately as you can and interpret the result.
no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)
n.rec = length(recycling)
n.no.rec = length(no.recycling)
group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))
dat = data.frame(var = c(recycling,no.recycling),group = group)
ggplot(dat) +
geom_histogram(aes(x = var,fill=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(dat) +
geom_histogram(aes(x = var,fill = group),bins=12) +
facet_wrap(~group,nrow=2)
Both shape and spread seem to be different here, although it is difficult to tell based on the small sample size. A Normal Distribution can however be ruled out, especially fro the recycling group.
ggplot(dat) +
geom_qq(aes(sample = var,col=group)) +
geom_qq_line(aes(sample = var,col=group))+
facet_wrap(~group,nrow=2)
qqPlot(recycling)
## [1] 20 19
qqPlot(no.recycling)
## [1] 22 21
c and d)
\(H_0:\) It is equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population. \(H_A:\) It is not equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population.
wilcox.test(var~group,data=dat)
## Warning in wilcox.test.default(x = c(4, 4, 4, 4, 4, 4, 4, 5, 8, 9, 9, 9, :
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: var by group
## W = 126.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
# alternative without dataframes
wilcox.test(recycling,no.recycling)
## Warning in wilcox.test.default(recycling, no.recycling): cannot compute exact p-
## value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: recycling and no.recycling
## W = 313.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
For comparison, a t-test in this case would yield (note that the t-test uses a slightly different null hypothesis ):
t.test(var~group,data=dat)
##
## Welch Two Sample t-test
##
## data: var by group
## t = -2.1447, df = 19.681, p-value = 0.04465
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34.9245559 -0.4663532
## sample estimates:
## mean in group No.Recycling mean in group Recycling
## 9.454545 27.150000
Both test actually give similar results, which is a good sign. We therefore reject the null hypothesis. A randomly selected bin from the no recycling group is more likely to contain more paper as compared to a randomly selected bin from the other group.
Use the data from exercise 26 and perform permutation test on the data. Visualize the empirical sampling distribution and calculate a P-value. (Look up the example of a permutation test in the above notes.) Compare it to the results from the exercise 26.
Alternatively, and easier, take a look at the package: perm and the function permTS. The function permTS automatically performs a permutation test for you.
Hints: Calculate the diffrence of the means:
no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)
n.rec = length(recycling)
n.no.rec = length(no.recycling)
group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))
dat = data.frame(var = c(recycling,no.recycling),group = group)
MeanR = mean(no.recycling)
MeanNR = mean(recycling)
diffMeans = MeanNR - MeanR
Calculate difference between means for randomly created groups:
nPerm = 1000
permResult <- numeric(n)
permSample <- sample(dat$group, replace = FALSE)
permMeanR <-mean(dat$var[permSample=="Recycling"])
permMeanNR <-mean(dat$var[permSample=="No.Recycling"])
permResult[1] <- permMeanNR - permMeanR
Now use a loop to calculate many replicates of the random groupings.
We first calculate the actual difference of the means:
no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)
n.rec = length(recycling)
n.no.rec = length(no.recycling)
group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))
dat = data.frame(var = c(recycling,no.recycling),group = group)
MeanR = mean(no.recycling)
MeanNR = mean(recycling)
diffMeans = MeanNR - MeanR
nPerm = 1000
permResult <- vector() # initializes
for(i in 1:nPerm){
# step 1: permute the group labels
permSample <- sample(dat$group, replace = FALSE)
# step 2: calculate difference betweeen means
permMeanR <-mean(dat$var[permSample=="Recycling"])
permMeanNR <-mean(dat$var[permSample=="No.Recycling"])
permResult[i] <- permMeanNR - permMeanR
}
Plot null distribution based on the permuted differences.
hist(permResult, right = FALSE, breaks = 100,col="darkred")
Note how the distribution looks odd. This is due to the small sample size: \(n_1 = 20, n_2 =22\). More precily, this is problematic because there are few extreme values in the data set (129 and 130 in the recycling dataset). These extreme points will have a strong impact on the empirical null distribution: if both are in the same resampled group, they will shift the whole distribution strongly to the left or right. This causes the bimodal distribution. This is a sign that the sample size might be too small for a permutation test. However, usually this leads to small power and large p-values. So we can nevertheless use the null distribution to calculate an approximate P-value, we just need to be aware that our results will be very conservative. The p value is twice the proportion of the permuted means that fall below the observed difference in means, diffMeans. The following code calculates the number of permuted means falling below diffMeans.
sum(as.numeric(permResult >= diffMeans))
## [1] 0
These commands obtain the fraction of permuted means falling below diffMeans.
sum(as.numeric(permResult >= diffMeans)) / nPerm
## [1] 0
Finally, multiply by 2 to get the P-value for a two-sided test.
2 * ( sum(as.numeric(permResult > diffMeans)) / nPerm )
## [1] 0
We can illustrate the permutaed differences and add the observed difference on to show the outcome of our test: the birght red colored bars indicate the proportion of permutation results that yielded a more exterme results as the one observed.
hist(permResult[permResult<=diffMeans],breaks=seq(-20,20,by=1),col="darkred",main="")
hist(permResult[permResult>diffMeans],breaks=seq(-20,20,by=1),add=T,col="red")
abline(v = diffMeans,col="red",lwd=2)
An alternative approach is
library(perm)
permTS(dat$var[dat$group=="Recycling"],dat$var[dat$group=="No.Recycling"],exact = T)
##
## Exact Permutation Test Estimated by Monte Carlo
##
## data: GROUP 1 and GROUP 2
## p-value = 0.004
## alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is not equal to 0
## sample estimates:
## mean GROUP 1 - mean GROUP 2
## 17.69545
##
## p-value estimated from 999 Monte Carlo replications
## 99 percent confidence interval on p-value:
## 1.003509e-05 1.482735e-02
We again find that we can reject the Null Hypothesis.
Many biological processes display circadian rhythms in activity, which presumably operate to coordinate cellular functions with daily environmental oscillations. Many tissues themselves exhibit circadian rhythms of activity to optimize specific processes which require coordination with the light-dark cycle. We want to investigate the phase shift in the circadian rhythm of melatonin production in participants g
iven alternative light treatments. Light treatments were applied to the knees or the eyes, as well as a control group without any treatment. The phase shift in the circadian rhythm of melatonin production was recorded. You can download the data with this command (original publication DOI: 10.1126/science.1071697):
circadian <- read.csv(url("https://whitlockschluter3e.zoology.ubc.ca/Data/chapter15/chap15e1KneesWhoSayNight.csv"), stringsAsFactors = FALSE)
Visualize the data.
Test whether any of the treatments lead to change in the phase shift. Assume that the data is normally distributed (ANOVA!).
If the ANOVA tells you that one of the groups is significantly different from the others, find out which one and discuss the results.
The original goal of this publication was to replicate findings from an earlier study that showed that light treatment behind the knee can alter your circadian rhythm of melatonin production. The authors replicated the experiment using the same setup and machines, but improved the study design by including a proper control group and using a double-blind experimental design. What did the new study find? Do the new results support the original findings? What can we learn from this?
I first change the order of the groups for better comparison:
circadian$treatment <- factor(circadian$treatment,
levels = c("control", "knee", "eyes"))
The data can be visualized in several different ways and I chose a simple strip chart. This looks nice and has the advantage that we can actually see the raw data.
ggplot(circadian, aes(x = treatment, y = shift)) +
geom_point(color = "firebrick", size = 3, shape = 1) +
labs(x = "Light treatment", y = "Shift in circadian rhythm (h)") +
theme_classic()
If you want to also include error bars you have already seen many ways to do this, so I use a new one that is relatively elegant:
ggplot(circadian, aes(x = treatment, y = shift)) +
geom_point(color = "firebrick", size = 3, shape = 1) +
stat_summary(fun.data = mean_se, geom = "errorbar",
colour = "black", width = 0.1,
position=position_nudge(x = 0.15)) +
stat_summary(fun.y = mean, geom = "point",
colour = "firebrick", size = 3,
position=position_nudge(x = 0.15)) +
labs(x = "Light treatment", y = "Shift in circadian rhythm (h)") +
theme_classic()
## Warning: `fun.y` is deprecated. Use `fun` instead.
The ANOVA table is made in two steps. The first step involves fitting the ANOVA model to the data using lm() (“lm” stands for “linear model”, of which ANOVA is one type). Then we use the command anova() to assemble the table.
circadianAnova <- lm(shift ~ treatment, data = circadian)
anova(circadianAnova)
## Analysis of Variance Table
##
## Response: shift
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 7.2245 3.6122 7.2894 0.004472 **
## Residuals 19 9.4153 0.4955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the ANOVA table we see that the F statistic is quite large and the P value is small. That means that we can reject the null hypothesis that all means are equal for any significance threshold of \(0.005\) or higher. We can thus be fairly certain that one of the three groups is different from the others.
pairwise.t.test(circadian$shift,circadian$treatment)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: circadian$shift and circadian$treatment
##
## control knee
## knee 0.9418 -
## eyes 0.0088 0.0088
##
## P value adjustment method: holm
Note that these p-values are automatically adjusted for multiple testing!
lion <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e1LionNoses.csv"))
head(lion)
## proportionBlack ageInYears
## 1 0.21 1.1
## 2 0.14 1.5
## 3 0.11 1.9
## 4 0.13 2.2
## 5 0.12 2.6
## 6 0.13 3.2
The data set consists of two mesurements. The age of lions and the proportion of bacl pigmentation ont he nose of the lion.
Visualize and inspect the data
Calcualte the correlation coefficient of the two variables.
Estimate a linear regression model using the data. Inspect the output and write down the model.
Check the assumptions of the linear regression.
Calculate a confidence interval for the slope of the regression line.
Test if the slope of the model significantly different from 0.
Plot the data and add a regression line and a confidence band for the regression line.